### Questions to Investigate

Can you identify any limitations or distortions of the data?

What is the most popular name of all time? (Of either gender.)

What is the most gender ambiguous name in 2013? 1945?

Of the names represented in the data, find the name that has had the largest percentage increase in popularity since 1980. Largest decrease?

Can you identify names that may have had an even larger increase or decrease in popularity?

#### import libraries

In [3]:
import pandas as pd
import numpy as np
import scipy as sp
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.cm as cmx
import matplotlib.colors as colors
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline
from IPython.display import display

#### load data

In [4]:
national_names = pd.read_csv('Datasets/NationalNames.csv', delimiter = ',', usecols = [1, 2, 3, 4])
state_names = pd.read_csv('Datasets/StateNames.csv', delimiter = ',', usecols = [1, 2, 3, 4, 5])
display(national_names.head())
display(state_names.head())

Unnamed: 0,Name,Year,Gender,Count
0,Mary,1880,F,7065
1,Anna,1880,F,2604
2,Emma,1880,F,2003
3,Elizabeth,1880,F,1939
4,Minnie,1880,F,1746


Unnamed: 0,Name,Year,Gender,State,Count
0,Mary,1910,F,AK,14
1,Annie,1910,F,AK,12
2,Anna,1910,F,AK,10
3,Margaret,1910,F,AK,8
4,Helen,1910,F,AK,7


In [5]:
# last year in national names data set
print "Last year: ", national_names.iloc[-1]['Year']

# number of rows in data frame
print "Num rows: ", len(national_names.index)

Last year:  2014
Num rows:  1825433


### Question 1: Can you identify any limitations or distortions of the data?

The years included in the national data set and the state data set do not match up. The national data begins in 1880 while the state data begins in 1910. This limits the comparisons we can make between state and national data.

We not have totals for each name in the data set since the data is broken up by year. This means we have to calculate those separately by summing up the counts of each name by year.

There is a different number of names for each row and for each state because the criteria used for including the name in the data set is that the name has to have a minimum of five occurrences. Thus we cannot assume that the number of names in the data set will be the same for any two years.

The data is not broken up into smaller components that could make analysis easier. When analyzing the data, we have to look at the data frames in their entirety, which will be difficult since the data frames are so large.

### Question 2: What is the most popular name of all time? (Of either gender.)

In [6]:
# build dictionary of names, values are counts
names_dict = {}
names = national_names.iloc[:,0].values
counts = national_names.iloc[:,-1].values
for i in range(len(national_names.index)):
    if names[i] not in names_dict:
        names_dict[names[i]] = counts[i]
    # or increase count
    else:
        names_dict[names[i]] += counts[i]

# now do some processing on dict to get most pop name
names_list = names_dict.items()
print "Names (first 10): ", names_list[:10]
# sort that list
names_list = sorted(names_list, key=lambda x: x[1])
print "Names (first 10 sorted): ", names_list[-10:]

Names (first 10):  [('Dejamarie', 6), ('Annabellah', 5), ('Charelle', 429), ('Dago', 94), ('Jhase', 179), ('Derika', 440), ('Katavia', 33), ('Derike', 6), ('Zakharia', 15), ('Jazzmon', 106)]
Names (first 10 sorted):  [('Charles', 2376700), ('Richard', 2564867), ('Joseph', 2580687), ('David', 3590557), ('William', 4071368), ('Mary', 4130441), ('Michael', 4330805), ('Robert', 4816785), ('John', 5106590), ('James', 5129096)]


So our clear winner is **James** at over five million (5,129,096 to be exact).

### Question 3: What is the most gender ambiguous name in 2013? 1945?

In [7]:
# only look at data for 2013 and 1945
data_2013 = national_names[national_names['Year'] == 2013]
data_1945 = national_names[national_names['Year'] == 1945]

#### 2013 Data

In [8]:
# add filler name to list of names
female_2013 = data_2013[data_2013['Gender'] == 'F']
male_2013 = data_2013[data_2013['Gender'] == 'M']

In [9]:
# add filler name to list of names
female_names_2013 = ['Filler']
male_names_2013 = ['Filler']

# iterate over rows in data set to gather all names
for index, row in female_names_2013.iterrows():
    # select name from row of data set
    name = row['Name']
    
    # skip row if name has already been added to list
    if (name in names):
        continue
    
    # append name to list of names
    names_2013.append(name)

# delete filler name in list
del names_2013[0]

AttributeError: 'list' object has no attribute 'iterrows'

#### 1945 Data

In [50]:
# add filler name to list of names
names_1945 = ['filler']

# iterate over rows in data set to gather all names
for index, row in data_1945.iterrows():
    # select name from row of data set
    name = row['Name']
    
    # skip row if name has already been added to list
    if (name in names):
        continue
    
    # append name to list of names
    names_1945.append(name)

# delete filler name in list
del names_1945[0]

### Question 4: Of the names represented in the data, find the name that has had the largest percentage increase in popularity since 1980. Largest decrease?

In [48]:
# look at data for 1980 forward
data_1980 = national_names[national_names['Year'] >= 1980]

In [55]:
# add filler name to list of names
names_1980 = ['filler']

# iterate over rows in data set to gather all names
for index, row in data_1980.iterrows():
    # select name from row of data set
    name = row['Name']
    
    # skip row if name has already been added to list
    if (name in names_1980):
        continue
    
    # append name to list of names
    names_1980.append(name)

# delete filler name in list
del names_1980[0]

KeyboardInterrupt: 

### Question 5: Can you identify names that may have had an even larger increase or decrease in popularity?