### Questions to Investigate

Can you identify any limitations or distortions of the data?

What is the most popular name of all time? (Of either gender.)

What is the most gender ambiguous name in 2013? 1945?

Of the names represented in the data, find the name that has had the largest percentage increase in popularity since 1980. Largest decrease?

Can you identify names that may have had an even larger increase or decrease in popularity?

#### import libraries

In [16]:
import pandas as pd
import numpy as np
import scipy as sp
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.cm as cmx
import matplotlib.colors as colors
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline

#### load data

In [13]:
national_names = pd.read_csv('Datasets/NationalNames.csv', delimiter = ',', usecols = [1, 2, 3, 4])
state_names = pd.read_csv('Datasets/StateNames.csv', delimiter = ',', usecols = [1, 2, 3, 4, 5])

In [14]:
national_names.head()

Unnamed: 0,Name,Year,Gender,Count
0,Mary,1880,F,7065
1,Anna,1880,F,2604
2,Emma,1880,F,2003
3,Elizabeth,1880,F,1939
4,Minnie,1880,F,1746


In [15]:
state_names.head()

Unnamed: 0,Name,Year,Gender,State,Count
0,Mary,1910,F,AK,14
1,Annie,1910,F,AK,12
2,Anna,1910,F,AK,10
3,Margaret,1910,F,AK,8
4,Helen,1910,F,AK,7


### Question 1: Can you identify any limitations or distortions of the data?

The years included in the national data set and the state data set do not match up. The national data begins in 1880 while the state data begins in 1910. This limits the comparisons we can make between state and national data.

We not have totals for each name in the data set since the data is broken up by year. This means we have to calculate those separately by summing up the counts of each name by year.

There is a different number of names for each row and for each state because the criteria used for including the name in the data set is that the name has to have a minimum of five occurrences. Thus we cannot assume that the number of names in the data set will be the same for any two years.

The data is not broken up into smaller components that could make analysis easier. When analyzing the data, we have to look at the data frames in their entirety, which will be difficult since the data frames are so large.

### Question 2: What is the most popular name of all time? (Of either gender.)

In [37]:
# last year in national names data set
print national_names.iloc[-1]['Year']

# number of rows in data frame
print len(national_names.index)

2014
1825433


In [67]:
names = national_names['Name'].tolist()
len(names)

1825433

In [68]:
names_unique = list(set(names))
len(names_unique)

93889

In [61]:
# add filler name to list of names
names_national = ['filler']

# iterate over rows in data set to gather all names
for i in range(0, len(names)):
    # select name from row of data set
    name = names[i]
    
    # skip row if name has already been added to list
    if (name in names_national):
        continue
    
    # append name to list of names
    names_national.append(name)

# delete filler name in list
del names[0]

KeyboardInterrupt: 

In [23]:
# iterate over rows in data set to count total occurrences of name
top_count = 0
top_name = 'X'

for i in range(0, len(names)):
    name = name[i]
    counter = 0
    
    for index, row in national_names.iterrows():
        if (row['Name'] == name):
            counter = counter + row['Count']
    
    if (counter > top_count):
        top_name == name
        top_count == counter

print "Most popular name of all time:", top_name, '\n'
print "Number of occurrences:", 

[1880, 1881, 1882, 1883, 1884, 1885, 1886, 1887, 1888, 1889, 1890, 1891, 1892, 1893, 1894, 1895, 1896, 1897, 1898, 1899, 1900, 1901, 1902, 1903, 1904, 1905, 1906, 1907, 1908, 1909, 1910, 1911, 1912, 1913, 1914, 1915, 1916, 1917, 1918, 1919, 1920, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014]


### Question 3: What is the most gender ambiguous name in 2013? 1945?

In [43]:
# only look at data for 2013 and 1945
data_2013 = national_names[national_names['Year'] == 2013]
data_1945 = national_names[national_names['Year'] == 1945]

33203


#### 2013 Data

In [54]:
# add filler name to list of names
female_2013 = data_2013[data_2013['Gender'] == 'F']
male_2013 = data_2013[data_2013['Gender'] == 'M']

In [46]:
# add filler name to list of names
female_names_2013 = ['Filler']
male_names_2013 = ['Filler']

# iterate over rows in data set to gather all names
for index, row in female_names_2013.iterrows():
    # select name from row of data set
    name = row['Name']
    
    # skip row if name has already been added to list
    if (name in names):
        continue
    
    # append name to list of names
    names_2013.append(name)

# delete filler name in list
del names_2013[0]

#### 1945 Data

In [50]:
# add filler name to list of names
names_1945 = ['filler']

# iterate over rows in data set to gather all names
for index, row in data_1945.iterrows():
    # select name from row of data set
    name = row['Name']
    
    # skip row if name has already been added to list
    if (name in names):
        continue
    
    # append name to list of names
    names_1945.append(name)

# delete filler name in list
del names_1945[0]

### Question 4: Of the names represented in the data, find the name that has had the largest percentage increase in popularity since 1980. Largest decrease?

In [48]:
# look at data for 1980 forward
data_1980 = national_names[national_names['Year'] >= 1980]

In [55]:
# add filler name to list of names
names_1980 = ['filler']

# iterate over rows in data set to gather all names
for index, row in data_1980.iterrows():
    # select name from row of data set
    name = row['Name']
    
    # skip row if name has already been added to list
    if (name in names_1980):
        continue
    
    # append name to list of names
    names_1980.append(name)

# delete filler name in list
del names_1980[0]

KeyboardInterrupt: 

### Question 5: Can you identify names that may have had an even larger increase or decrease in popularity?