# Data Science with `Python` Practice

This is your first practice notebooks. The purpose of these practices is to reiterate some of the content that you went over in the lab, as well as introduce some new material to you with a bit of a guiding helping along the way. Unlike the labs, these notebooks will be incomplete in the sense that you will actively be editing/writing code to modify/produce output. The skeleton is already here but throughout these practice notebooks, we will be asking you to add the rest of the corpus. In doing so, you will hone your data science techniques as well as learn how to search for solutions to your programming hurdles.

Today we will be going over the fundamentals of data science with Python. Much of the content will be similar to your lab, [Introduction to Data Science with Python](../labs/intro_data_science_python.ipynb), and thus it will serve as a good guide post to answering some of the questions. We'll begin today with reading in the data...

## Read in Data 
For this practice we will be using a different baby names dataset.

We want to read in the data without using any libraries.

In [4]:
with open('../../../datasets/baby-names/NationalNames2.csv', 'r') as file:
    data = file.read()
    print(repr(data[0:101]))


FileNotFoundError: [Errno 2] No such file or directory: '../../../datasets/baby-names/NationalNames2.csv'

Currently, we are only able to use the indexes to locate specific characters in all of the data; this includes some unwanted characters such as commas and new line characters. In other words, all of the data are stored in a single string which is not very useful.  

**Activity 1**: *Read in the file so that it is a list of lists. In other words, I should be able to access each row individually as well as individual values within the row.* 

In [5]:
# Code for activity 1 goes here
# -----------------------------

with open('../../../datasets/baby-names/NationalNames1.csv', 'r') as file:
    data = file.read()

    data_lists = data.split("\n")

    list_of_lists = []
    for line in data_lists:
        row = line.split(',')
        list_of_lists.append(row)

print(list_of_lists[0:10])

FileNotFoundError: [Errno 2] No such file or directory: '../../../datasets/baby-names/NationalNames1.csv'

To make sure everyone is working with the data loaded the same way, we are going to go ahead and read in the data using the `csv` library. Remembering that there is a lot of data to work with here so we are going to go ahead and subset it.

In [6]:
import csv

# create a list of lists with csv library and store the data in a `data_list` variable
data_list = list(csv.reader(open('../../../datasets/baby-names/NationalNames2.csv'),  delimiter=','))

# create a subset of the entire data set to speed things up
subset = data_list[1:301]

FileNotFoundError: [Errno 2] No such file or directory: '../../../datasets/baby-names/NationalNames2.csv'

Take the following scenario:

Imagine that we want to find those names in the data set that are not that common. Let's go ahead and classify names that have a `Count` less than 30 as being not that popular. This is almost the bit of code we need to find all of those rows that are less than 30 but we are getting an error.

**Activity 2**: *In the second cell below, correct (de-bug) the following code and answer the following questions.*

In [7]:
for row in subset:
    if int(row[4]) < 30:
        print(row[1])

NameError: name 'subset' is not defined

**Questions**:
1. What does the following error mean? 
2. How would you correct it so that the names that have less than 30 people who are named that are `print`ed out? 

In [8]:
# Code for activity 2 goes here
# -----------------------------
# 1. Answer the question here as a comment
# the count column needs to be cast as a int it is currently a str
# 2. Below put the corrected code
for row in subset:
    if int(row[4]) < 30:
        print(row[1])


NameError: name 'subset' is not defined

## Data Manipulation with `pandas`

We are going to transition to using `pandas` now. Let's begin by reading in the file...

In [9]:
import pandas as pd

df = pd.read_csv('../../../datasets/baby-names/NationalNames2.csv')

FileNotFoundError: [Errno 2] File b'../../../datasets/baby-names/NationalNames2.csv' does not exist: b'../../../datasets/baby-names/NationalNames2.csv'

In [None]:
#displaying the first 10 rows
df.head(10)

So this looks good, but the `Id` column from the original file is redundant because `pandas` provides our data frame with one already. 

**Activity 3**: *Remove the `Id` column upon reading in the data.*

In [10]:
# Code for activity 3 goes here
# -----------------------------
with open('../../../datasets/baby-names/NationalNames2.csv', 'r') as file:
    df = pd.read_csv(file)
    df.drop('Id', 1, inplace=True)
    
df.head()

FileNotFoundError: [Errno 2] No such file or directory: '../../../datasets/baby-names/NationalNames2.csv'

We now want to subset the data frame to only display rows for female names. Remember, here is how we do that in `pandas`. 

In [11]:
females = df[df['Gender'] == 'F']

NameError: name 'df' is not defined

Remember though, we are trying to find names that are not very common.

**Activity 4**: *From this subset of female names, return a data frame with those names who have less than 30 for their count. Name this data frame `uncommon_f`.*

In [12]:
# Code for activity 4 goes here 
# -----------------------------

uncommon_f = females[females['Count'] < 30]

uncommon_f.head()


NameError: name 'females' is not defined

Now let's do something similar for male names, but this time we should include both uncommon and very common names in our subset.

**Activity 5**: *Create a data frame of male names that are less than 30 or greater than or equal to 1000 for their count. Name this data frame `com_uncom_m`.*

In [None]:
# Code for activity 5 goes here 
# -----------------------------

males = df[df['Gender'] == 'M']
com_uncom_m= males[(males['Count'] > 30) | (males['Count']< 1000)]

com_uncom_m.head()

We are going to go ahead and do some sorting now. Remember this bit of code from the lab exercises where we sorted the rows by `Count`.

In [None]:
df.sort_values(by = ['Count'], ascending = True).head(10)

**Activity 6**: *Now sort the data frame, `df`, by `Year` and alphabetically by `Name`.* 

In [None]:
# Code for activity 6 goes here 
# -----------------------------

df.sort_values(by = ['Year','Name'], ascending = True)

Below is one way to find the most popular, by absolute value, name of the entire data set. 

In [None]:
df.sort_values(by = ['Count'], ascending = True).tail(1)

But what if we were interested in something a bit more specific? Perhaps, the most popular name during a given year.

**Activity 7**: *Find the most popular female name in the year 1881.*

In [None]:
# Code for activity 7 goes here 
# -----------------------------

females[females['Year']==1881].sort_values(by = ['Count'], ascending = False).head(1)



This final practice exercise is going to be a challenge. Challenge exercises are meant to encourange you to expand on what you have already learned and search for answers that we may have not explicitly gone over. 

Imagine if we only wanted to find names only starting with a certain letter. 

**Activity 8**: *Create a subset of names from the data set that start with the letter "E". Name this data frame `starts_with_e`.*

In [None]:
# Code for activity 8 goes here 
# -----------------------------

starts_with_e = df[df['Name'].str.startswith('E')]

starts_with_e.head()

# Save your notebook, then `File > Close and Halt`