# Numpy

Before we start working with numpy... Let's remember what we used to do without them. Create two lists: one for names and another for the scores for those people in a given class:

In [1]:
names = ["Charlotte", "Ingrid", "Ian", "Eric"]
scores = [80, 95, 85, 70]

Now let's say that we wanted to divide each of those scores by two and assign the results to another variable. Let's first do it without numpy and assign the results to a 
variable called ```half```:

In [2]:
# insert code here
half = []
for score in scores:
    half.append(score/2.0)

In [3]:
half

[40.0, 47.5, 42.5, 35.0]

We can put data into an array structure that allows us to apply more powerful
functions.  The data structure that we're interested in is called an ```ndarray``` and is from the ```numpy``` package:

In [4]:
import numpy as np
ascores = np.array(scores)

In [5]:
ascores

array([80, 95, 85, 70])

In [6]:
ahalf = ascores /2

In [7]:
ahalf

array([40. , 47.5, 42.5, 35. ])

Numpy arrays are powerful, but they have some limitations:  they can only 
consist of one type of data (e.g. int), etc.  pandas provides two additional
data structures that are built on numpy ndarrays: series and dataframes

# Series

Let's create a simple pandas Series and examine it:

In [8]:
import pandas as pd
sscores = pd.Series(scores,name='scores')

In [9]:
sscores

0    80
1    95
2    85
3    70
Name: scores, dtype: int64

So you see a couple of useful things: an index (0 to 3) and a data type (dtype), which in this case is an int64.

**A Series is a one-dimensional ndarray with axis labels**

In [10]:
data = dict(zip(names,scores))

In [11]:
data

{'Charlotte': 80, 'Ingrid': 95, 'Ian': 85, 'Eric': 70}

In [12]:
sData = pd.Series(data=data,name='score')

In [13]:
sData

Charlotte    80
Ingrid       95
Ian          85
Eric         70
Name: score, dtype: int64

So Series are a bit friendlier than numpy arrays, but they're still only one-dimensional.  Keep in mind that our basic data abstraction is a table, which can
be thought of as a two-dimensional array.  

# DataFrame

Let's go ahead and create a simple DataFrame with just one column:

In [14]:
pd.DataFrame(scores,index=names,columns=['score'])

Unnamed: 0,score
Charlotte,80
Ingrid,95
Ian,85
Eric,70


## Working with real data

We can create DataFrames by reading a file as well. This allows us to work with large-scale real data:

In [15]:
df = pd.read_csv('names.csv')

## Data Description

This dataset characterizes how baby name trends change over years. Each row lists the frequency of one name (separately for each gender) for each year--listing only the years where there is at least one baby with that name and gender born.

## Metadata

**name**

**gender**

**birth_count**

**year**

In [16]:
df.head()

Unnamed: 0,name,gender,birth_count,year
0,Simeon,M,23,1880
1,Raoul,M,7,1880
2,Lou,M,14,1880
3,Myra,F,83,1880
4,Alois,M,10,1880


In [17]:
df.sample(5)

Unnamed: 0,name,gender,birth_count,year
1687087,Matan,M,32,2010
163434,Evaleen,F,7,1919
1227625,Jeran,M,8,1996
493749,Mayetta,F,6,1953
452535,Mercy,F,29,1949


Finally, you can get some basic information about the size and shape of the DataFrame:

In [18]:
print("The number of rows of the dataset is: ", len(df))
print("The number of columns of the dataset is: ", len(df.columns))
print("The shape of the dataset is: ", df.shape)

The number of rows of the dataset is:  1825433
The number of columns of the dataset is:  4
The shape of the dataset is:  (1825433, 4)


In [19]:
df.columns

Index(['name', 'gender', 'birth_count', 'year'], dtype='object')

And you can extract one or more columns:

In [20]:
print(df['name'])

0             Simeon
1              Raoul
2                Lou
3               Myra
4              Alois
             ...    
1825428       Suheyb
1825429        Asani
1825430     Leonitus
1825431    Josselynn
1825432       Vaiden
Name: name, Length: 1825433, dtype: object


In [21]:
name_year = df[['name','year']]
name_year.head()

Unnamed: 0,name,year
0,Simeon,1880
1,Raoul,1880
2,Lou,1880
3,Myra,1880
4,Alois,1880


## Extracting rows

In [22]:
df.iloc[0]

name           Simeon
gender              M
birth_count        23
year             1880
Name: 0, dtype: object

In [23]:
df.loc[0]

name           Simeon
gender              M
birth_count        23
year             1880
Name: 0, dtype: object

In [24]:
df_year = df.set_index('year')

In [25]:
df_year.loc[1881]

Unnamed: 0_level_0,name,gender,birth_count
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1881,Morris,M,87
1881,Emerson,M,20
1881,Adrien,M,5
1881,Fredrick,M,65
1881,Obie,M,6
...,...,...,...
1881,Bina,F,5
1881,Tillie,F,82
1881,Alford,M,17
1881,Nola,F,32


## Sorting
You can use either sort_values() or sort_index(). Lets say that we want to sort this DataFrame according to the birth count to reveal the least popular (name,gender,year) combinations:

In [26]:
df_sorted = df.sort_values('birth_count',ascending=True)
df_sorted.head(10)

Unnamed: 0,name,gender,birth_count,year
1444602,Sheva,F,5,2003
506503,Suzana,F,5,1954
506490,Albertina,F,5,1954
1472767,Lillieanna,F,5,2004
506484,Ponce,M,5,1954
506483,Sally,M,5,1954
506480,Toxie,M,5,1954
1472774,Talaia,F,5,2004
1472775,Selicia,F,5,2004
1472780,Ciel,M,5,2004


In [27]:
df_sorted = df.sort_values('birth_count',ascending=False)
df_sorted.head(10)

Unnamed: 0,name,gender,birth_count,year
439300,Linda,F,99680,1947
443963,Linda,F,96205,1948
436127,James,M,94755,1947
540259,Michael,M,92709,1957
431395,Robert,M,91642,1947
457418,Linda,F,91010,1949
527360,Michael,M,90633,1956
556166,Michael,M,90519,1958
444391,James,M,88596,1948
514137,Michael,M,88485,1954


## Filtering using Boolean Masking

In [28]:
df['birth_count'] > 10

0           True
1          False
2           True
3           True
4          False
           ...  
1825428    False
1825429    False
1825430    False
1825431    False
1825432    False
Name: birth_count, Length: 1825433, dtype: bool

In [29]:
df[df['birth_count'] > 10]

Unnamed: 0,name,gender,birth_count,year
0,Simeon,M,23,1880
2,Lou,M,14,1880
3,Myra,F,83,1880
6,Arthur,M,1599,1880
7,Vena,F,11,1880
...,...,...,...,...
1825421,Donaven,M,19,2014
1825423,Kinley,F,1610,2014
1825425,Keylan,M,36,2014
1825426,Issabelle,F,15,2014


In [30]:
df[df['birth_count'] > 40]

Unnamed: 0,name,gender,birth_count,year
3,Myra,F,83,1880
6,Arthur,M,1599,1880
11,Inez,F,106,1880
13,Sylvester,M,89,1880
18,Arch,M,61,1880
...,...,...,...,...
1825413,Landree,F,60,2014
1825414,Erynn,F,43,2014
1825416,Asma,F,78,2014
1825417,Montana,F,124,2014


### Example: Find male names after 1990 with at least 100 mentions for at least one year. </font>

Lets start with filtering things, we need the gender to be male and count to be at least 100

In [31]:
df_filtered = df[(df['gender'] == 'M') & (df['year'] > 1990) & (df['birth_count'] >= 100)]
df_filtered.head()

Unnamed: 0,name,gender,birth_count,year
1094165,Alexander,M,17635,1991
1094178,Brent,M,2194,1991
1094210,Damien,M,812,1991
1094269,Mohammad,M,324,1991
1094302,Samson,M,108,1991


In [32]:
df_filtered.name.unique()

array(['Alexander', 'Brent', 'Damien', ..., 'Thorin', 'Nova', 'Danilo'],
      dtype=object)

## Descriptive and Summary Statistics

Example: What is the average number of times a name is used over the years?

In [33]:
df['birth_count'].mean()

184.68792116719703

<font color="red">Remember that your data is censored (you only have a line per name if it has at least 5 occurences in a given year. So you are computing average **conditional on there being at least 5 occurences.** It is very important to understand your data to perform accurate analysis)!</font>

Now, lets compute the median

In [34]:
df['birth_count'].median()

12.0

In [35]:
df['birth_count'].describe()

count    1.825433e+06
mean     1.846879e+02
std      1.566711e+03
min      5.000000e+00
25%      7.000000e+00
50%      1.200000e+01
75%      3.200000e+01
max      9.968000e+04
Name: birth_count, dtype: float64

## Final example
Example: Write one line of code to count the occurrences of male names and show the top 5 names.  </font>

In [36]:
df[df['gender'] == 'M'].groupby(['name'])[['birth_count']].sum().sort_values('birth_count',ascending=False).head()

Unnamed: 0_level_0,birth_count
name,Unnamed: 1_level_1
James,5105919
John,5084943
Robert,4796695
Michael,4309198
William,4055473
