# US - Baby Names

### Introduction:

We are going to use a subset of [US Baby Names](https://www.kaggle.com/kaggle/us-baby-names) from Kaggle.  
In the file it will be names from 2004 until 2014


### Step 1. Import the necessary libraries

In [91]:
import numpy as np
import pandas as pd

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv). 

### Step 3. Assign it to a variable called baby_names.

In [57]:
baby_names=pd.read_csv('US_Baby_Names_right.csv')

### Step 4. See the first 10 entries

In [7]:
baby_names[:10]

Unnamed: 0.1,Unnamed: 0,Id,Name,Year,Gender,State,Count
0,11349,11350,Emma,2004,F,AK,62
1,11350,11351,Madison,2004,F,AK,48
2,11351,11352,Hannah,2004,F,AK,46
3,11352,11353,Grace,2004,F,AK,44
4,11353,11354,Emily,2004,F,AK,41
5,11354,11355,Abigail,2004,F,AK,37
6,11355,11356,Olivia,2004,F,AK,33
7,11356,11357,Isabella,2004,F,AK,30
8,11357,11358,Alyssa,2004,F,AK,29
9,11358,11359,Sophia,2004,F,AK,28


### Step 5. Delete the column 'Unnamed: 0' and 'Id'

In [39]:
baby=baby_names.drop("Unnamed: 0",axis=1)
baby.head()

Unnamed: 0,Id,Name,Year,Gender,State,Count
0,11350,Emma,2004,F,AK,62
1,11351,Madison,2004,F,AK,48
2,11352,Hannah,2004,F,AK,46
3,11353,Grace,2004,F,AK,44
4,11354,Emily,2004,F,AK,41


### Step 6. Is there more male or female names in the dataset?

In [15]:
(baby.Gender=="F").mean()

0.5498315123549408

In [23]:
baby.groupby("Gender").agg({
    'Count':'sum'
})

Unnamed: 0_level_0,Count
Gender,Unnamed: 1_level_1
F,16380293
M,19041199


In [25]:
baby.groupby("Gender").agg({
    'Count':'sum'
})/baby.Count.sum()*100 # retunns the percentage of male and female

Unnamed: 0_level_0,Count
Gender,Unnamed: 1_level_1
F,46.243939
M,53.756061


### Step 7. Group the dataset by name and assign to names

In [30]:
names=baby.groupby('Name')
names.head()

Unnamed: 0,Id,Name,Year,Gender,State,Count
0,11350,Emma,2004,F,AK,62
1,11351,Madison,2004,F,AK,48
2,11352,Hannah,2004,F,AK,46
3,11353,Grace,2004,F,AK,44
4,11354,Emily,2004,F,AK,41
...,...,...,...,...,...,...
1004923,5546650,Gryffin,2014,M,WI,5
1004950,5546677,Kroy,2014,M,WI,5
1004973,5546700,Owyn,2014,M,WI,5
1005707,5583655,Haylea,2005,F,WV,5


### Step 8. How many different names exist in the dataset?

In [38]:
baby.Name.nunique() # returns tha number of unique name

17632

### Step 9. What is the name with most occurrences?

In [125]:
baby.sort_values(by=["Count"],ascending=False)

Unnamed: 0,Id,Name,Year,Gender,State,Count
107416,678594,Daniel,2004,M,CA,4167
110097,681275,Daniel,2005,M,CA,3914
115739,686917,Daniel,2007,M,CA,3865
112872,684050,Daniel,2006,M,CA,3826
107417,678595,Anthony,2004,M,CA,3805
...,...,...,...,...,...,...
470218,2627153,Gus,2005,M,MI,5
470217,2627152,Giuseppe,2005,M,MI,5
470216,2627151,Garrison,2005,M,MI,5
470215,2627150,Garett,2005,M,MI,5


In [126]:
baby.sort_values(by=["Count"],ascending=False).head()

Unnamed: 0,Id,Name,Year,Gender,State,Count
107416,678594,Daniel,2004,M,CA,4167
110097,681275,Daniel,2005,M,CA,3914
115739,686917,Daniel,2007,M,CA,3865
112872,684050,Daniel,2006,M,CA,3826
107417,678595,Anthony,2004,M,CA,3805


### Step 10. How many different names have the least occurrences?

In [121]:
baby.sort_values(by=["Count"],ascending=True)

Unnamed: 0,Id,Name,Year,Gender,State,Count
1016394,5647426,Waylon,2014,M,WY,5
638879,3570297,Sawyer,2013,F,NV,5
638878,3570296,Saniyah,2013,F,NV,5
638877,3570295,Rylan,2013,F,NV,5
638876,3570294,Remi,2013,F,NV,5
...,...,...,...,...,...,...
107417,678595,Anthony,2004,M,CA,3805
112872,684050,Daniel,2006,M,CA,3826
115739,686917,Daniel,2007,M,CA,3865
110097,681275,Daniel,2005,M,CA,3914


In [122]:
baby.sort_values(by=["Count"],ascending=True).head()

Unnamed: 0,Id,Name,Year,Gender,State,Count
1016394,5647426,Waylon,2014,M,WY,5
638879,3570297,Sawyer,2013,F,NV,5
638878,3570296,Saniyah,2013,F,NV,5
638877,3570295,Rylan,2013,F,NV,5
638876,3570294,Remi,2013,F,NV,5


### Step 11. What is the median name occurrence?

In [113]:
baby.median(axis=0)

  baby.median(axis=0)


Id       2811921.0
Year        2009.0
Count         11.0
dtype: float64

In [114]:
baby.Count.median(axis=0)

11.0

### Step 12. What is the standard deviation of names?

In [124]:
baby.Count.std() # return the standard deviation

97.39734648617814

### Step 13. Get a summary with the mean, min, max, std and quartiles.

In [101]:
baby["Count"].describe()

count    1.016395e+06
mean     3.485012e+01
std      9.739735e+01
min      5.000000e+00
25%      7.000000e+00
50%      1.100000e+01
75%      2.600000e+01
max      4.167000e+03
Name: Count, dtype: float64

In [102]:
baby.describe()

Unnamed: 0,Id,Year,Count
count,1016395.0,1016395.0,1016395.0
mean,2830991.0,2009.053,34.85012
std,1652476.0,3.138293,97.39735
min,11350.0,2004.0,5.0
25%,1317328.0,2006.0,7.0
50%,2811921.0,2009.0,11.0
75%,4242556.0,2012.0,26.0
max,5647426.0,2014.0,4167.0
