## Unit 9 Assignment - W200 Introduction to Data Science Programming, UC Berkeley MIDS

Write code in this Jupyter Notebook to solve the following problems. Please upload this **Notebook** with your solutions to your GitHub repository in your SUBMISSIONS/week_10 folder by 11:59PM PST the night before class.  Do not upload the data files or the answer .csv (we want your notebook to make the answers when we run it)

This homework assignment is assigned during Week 10 but corresponds to the Unit #9 async.

## Objectives

- Demonstrate how to import different data files
- Get a small glimpse on how messy data can be
- Design and implement an algorithm to standardize the information and fix the messiness
- Work with Python data structures to sort and output the correct information
- Demonstrate how to export required information to a .csv file

## Reading and Writing Data (25 Points)

In this assignment, you will be reading and writing data. Yes, finally some data science (or at least some exploratory data analysis)! In the week_10 assignment folder, there are three data files named: 

* data.csv
* data.json
* data.pkl

These are three common file formats. You can run the following **on the bash command line** to see what is in each file (this will not work from a Windows prompt but will work in git bash):

```sh
head data.csv
head data.pkl
head data.json
```

You'll see that there is some method to the madness but that each file format has its peculiarities. Each file contains a portion of the total dataset that altogether comprises 100 records, so you need to **read in all of the files and combine them into some standard format** with which you are comfortable. Aim for something standard where each "row" is represented in the same format. **Name this object that contains the data for all three files combined ```full_data```**

### Questions to answer (75 points: each question is worth 15 points):
After you've standardized all of the data, report the following information: 

1. What are the unique countries in the dataset, sorted alphabetically?  Write to a new file called question_1.csv.
2. What are the unique complete email domains in the dataset, sorted alphabetically?  Write to a new file called question_2.csv. 
3. What are the first names of everyone (including duplicates) that do not have a P.O. Box address, sorted alphabetically?  Write to a new file called question_3.csv.
4. What are the full names of the first 5 people when you sort the data alphabetically by country?  Write to a new file called question_4.csv.
5. What are the full names of the first 5 people when you sort the data numerically ascending by phone number?  Write to a new file called question_5.csv.

We will be using a script to examine and grade your .csv files so please make sure: 
- The answers are all in one **column** with one list item per cell, sorted as stated in the question. I.e., looking at the .csv in a spreadsheet editor like Google Sheets, all answers would be in the 'A' column, with the first entry in A1, the second in A2, etc.
- Please do not include a header; just the answers to the questions.
- It is strongly recommended that you open each .csv file to ensure the answers are there and displayed correctly! 
- Don't include quotes around the list items.  I.e., strip the leading and trailing quotes, if necessary, from items when you write to the .csv files.  For example, a list entry should look like ```Spain``` rather than ```"Spain"```. One exception: Some country names do contain commas and it is ok to have quotes: ```""``` around just those country names so that they will be in one cell in the .csv. 


In addition, show all of your work in this **Jupyter notebook**.

### Assumptions

- You might have to make decisions about the data. For example, what to do with ties or how to sort the phone numbers numerically. 
- Write your assumptions in this Jupyter notebook at the top of your code under the heading below that says ASSUMPTIONS
- Please do some research before making an assumption (e.g. what is a domain name?); put your notes inside that assumption so we can understand your thought process. 
  - NOTE: If you don't know what an email domain is - do some research and write what you found in your assumptions; there is a correct answer to this question! 
- This is a good habit to do as you analyze data so that you can remember why you made the decisions you did and other people can follow your analysis later!

### Restrictions
You should use these standard library imports:

```python
import json
import csv
import pickle
```

Some of you may be familiar with a Python package called `pandas` which would greatly speed up this sort of file processing.  The point of this homework is to do the work manually.  You can use `pandas` to independently check your work if you are so inclined but do not use `pandas` as the sole solution method. Don't worry if you are not familiar with `pandas`.  We will do this homework as a class exercise using `pandas` in the near future.

### Hints (optional)

- You may use regular expressions if you wish to extract data from each row. You do not need to use them if you do not want to or see a need to. The Python regular expression module is called `re`.
- You may want to use the operator library or the sorted function to help in sorting.
- There are many data structures and formats that you might use to solve this problem.  You will have to decide if you want to keep the information for each person together as one record or all the information for each of the fields together.
- You can put these files into sensible structures such as lists or or dictionaries. The async covers how to do this for csv and json. For pickle this might help https://wiki.python.org/moin/UsingPickle 
- .items() or .key() can be useful for dictionaries
- Once again, it is strongly recommended that you open each .csv file to ensure the answers are there and displayed correctly! 

In [2]:
# Your name here

### ASSUMPTIONS:
# Please write the assumptions here that you made during your data analysis
# Please keep this code at the very top of your code block so we can easily see it while grading!

#I have not included the data.csv, data.json, and data.pkl files in my overall submission as defined in the instructions to "not upload the data files"
#For question 1 - I have assumed that I will only provide the unique country names which means that any duplicates will be avoided.I have also assumed that the column of countries in the question_1.csv output needs to be in A-Z order through the sorted function. Just so that points are taken off, I also want to mention that it was assumed that length of the results doesn't matter when alphabetically ordering and the results in the csv output has been ordered based on the first letters of the word rather than the length.
#For question 2 - I have assumed that unique emails refers to avoid any duplications and making sure that the domain is part of the complete email addressing, meaning that the format would need to include the correct mix of letters/numbers including periods before the "@" and correct mix of letters/numbers including periods after the "@". I have used the following regular expression (re.match("[A-Za-z0-9-_]+(.[A-Za-z0-9-_]+)*@[A-Za-z0-9-]+(.[A-Za-z0-9]+)*(.[A-Za-z]{2,})",rows,re.IGNORECASE) to check for the format of the e-mail before and to allow one period for each sides after the "@". I have also assumed that ordering alphabetically means that the emails with an uppercase will be order first in alphabetical order and then emails in all lowercase will be alphabetically ordered next through the sorted function. Just so that points are taken off, I also want to mention that it was assumed that length of the results doesn't matter when alphabetically ordering and the results in the csv output has been ordered based on the first letters of the word rather than the length.
#For question 3 - I have assumed that the first name will be any names before the first space within the full name by using the following: key.split()[0] and also only writing first names of those who didn't have a "P.O." anywhere in their address data. I have also assumed that duplicates will be allowed so if more than one person has 'Mason' as the first name before a space from the full name, more than one 'Mason' will be included in the csv output. Just so that points are taken off, I also want to mention that it was assumed that length of the results doesn't matter when alphabetically ordering and the results in the csv output has been ordered based on the first letters of the word rather than the length.
#For question 4 - I have assumed that I will write the first 5 full names into the csv output "after" the name to country data has been alphabetically sorted by country.
#For question 5 - I have assumed that the phone numbers will be numerically ordered in ascending order from lowest amount of digits to highest amount of digits and from the lowest amount of digits, to start ordering from highest number to lowest number in each pool of the same number of digits. For example, if there are only 7 numbers then it will be order from highest number to lowest number and then, move onto ordering from highest number to lowest number for 8 digits and so on. I have made sure to sort by phone numbers through key=lambda x:x[1] and added the reverse=True for ascending order which means that the phone numbers will be ordered from lowest amount of digits to highest amount of digits and ordered from highest number to lowest number from the pool of the same amount of digits as stated in my previous statement.

# YOU MAY USE ANY NUMBER OF CELLS AS YOU NEED
# YOUR CODE HERE
import json
import csv
import pickle
import re

name_data = []
phone_data = []
address_data = []
city_data = []
country_data = []
email_data = []

with open("data.csv", encoding="utf-8") as openfile:
    csvReader = csv.DictReader(openfile)
    #convert each csv row into python dict
    for row in csvReader:
        key1 = row['Name']
        key2 = row['Phone']
        key3 = row['Address']
        key4 = row['City']
        key5 = row['Country']
        key6 = row['Email']
        #add this python dict to json array
        name_data.append(key1)
        phone_data.append(key2)
        address_data.append(key3)
        city_data.append(key4)
        country_data.append(key5)
        email_data.append(key6)

with open("data.json", "rb") as openfile:
    json_decoded = json.load(openfile)
    for key, value in json_decoded.items():
        if key == 'Name':
            for rows in value.values():
                name_data.append(rows)
        if key == 'Phone':
            for rows in value.values():
                phone_data.append(rows)
        if key == 'Address':
            for rows in value.values():
                address_data.append(rows)
        if key == 'City':
            for rows in value.values():
                city_data.append(rows)
        if key == 'Country':
            for rows in value.values():
                country_data.append(rows)
        if key == 'Email':
            for rows in value.values():
                email_data.append(rows)


with open("data.pkl", "rb") as openfile:
    pkl_file = pickle.load(openfile)
    for key, value in pkl_file.items():
        if key == 'Name':
            for rows in value.values():
                name_data.append(rows)
        if key == 'Phone':
            for rows in value.values():
                phone_data.append(rows)
        if key == 'Address':
            for rows in value.values():
                address_data.append(rows)
        if key == 'City':
            for rows in value.values():
                city_data.append(rows)
        if key == 'Country':
            for rows in value.values():
                country_data.append(rows)
        if key == 'Email':
            for rows in value.values():
                email_data.append(rows)



full_data = {"Name": name_data, "Phone": phone_data, "Address": address_data, "City": city_data, "Country": country_data, "Email": email_data}

#print(full_data)

#Question 1
unique_countries = csv.writer(open("question_1.csv", "w", newline=''))
for key, value in full_data.items():
    if key == 'Country':
        for rows in sorted(set(value)):
            unique_countries.writerow([rows])

#Question 2
unique_emails = csv.writer(open("question_2.csv", "w", newline=''))
for key, value in full_data.items():
    if key == 'Email':
        for rows in sorted(set(value)):
            if re.match("[A-Za-z0-9-_]+(.[A-Za-z0-9-_]+)*@[A-Za-z0-9-]+(.[A-Za-z0-9]+)*(.[A-Za-z]{2,})",rows,re.IGNORECASE):
                #You may also change unique_emails.writerow([rows]) below to unique_emails.writerow([rows.split('@')[1]]) for only outputs after @ if required but I left the output to have full email addresses for the sake of a more robust answer because just the domain seemed insufficient
                unique_emails.writerow([rows])

#Question 3
firstnames_without_pobox = csv.writer(open("question_3.csv", "w", newline=''))
pobox_lookup = dict(sorted(zip(name_data, address_data)))
for key, value in pobox_lookup.items():
    if 'P.O.' not in value:
        first_name = key.split()[0]
        firstnames_without_pobox.writerow([first_name])

#Question 4
firstfivepeople_bycountry = csv.writer(open("question_4.csv", "w", newline=''))
name_country = dict((zip(name_data, country_data)))
sorted_namecountry = dict(sorted(name_country.items(), key=lambda x:x[1]))
for key, value in list(sorted_namecountry.items())[:5]:
    firstfivepeople_bycountry.writerow([key])

#Question 5
firstfivepeople_byphone = csv.writer(open("question_5.csv", "w", newline=''))
name_phone = dict((zip(name_data, phone_data)))
sorted_namephone = dict(sorted(name_phone.items(),key=lambda x:x[1], reverse=True))
for key, value in list(sorted_namephone.items())[:5]:
    firstfivepeople_byphone.writerow([key])

In [None]:
# Autograde cell - do not erase/delete

In [None]:
# Autograde cell - do not erase/delete

In [None]:
# Autograde cell - do not erase/delete

In [None]:
# Autograde cell - do not erase/delete

In [None]:
# Autograde cell - do not erase/delete

In [None]:
# Autograde cell - do not erase/delete

In [None]:
# Autograde cell - do not erase/delete