# US - Baby Names

### Introduction:

We are going to use a subset of [US Baby Names](https://www.kaggle.com/kaggle/us-baby-names) from Kaggle.  
In the file it will be names from 2004 until 2014


### Step 1. Import the necessary libraries

In [None]:
import pandas as pd
import numpy as np

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv).

### Step 3. Assign it to a variable called baby_names.

In [None]:
baby_names=pd.read_csv("https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv")

In [None]:
baby_names.head()

Unnamed: 0.1,Unnamed: 0,Id,Name,Year,Gender,State,Count
0,11349,11350,Emma,2004,F,AK,62
1,11350,11351,Madison,2004,F,AK,48
2,11351,11352,Hannah,2004,F,AK,46
3,11352,11353,Grace,2004,F,AK,44
4,11353,11354,Emily,2004,F,AK,41


### Step 4. See the first 10 entries

In [None]:
baby_names.head(10)

Unnamed: 0.1,Unnamed: 0,Id,Name,Year,Gender,State,Count
0,11349,11350,Emma,2004,F,AK,62
1,11350,11351,Madison,2004,F,AK,48
2,11351,11352,Hannah,2004,F,AK,46
3,11352,11353,Grace,2004,F,AK,44
4,11353,11354,Emily,2004,F,AK,41
5,11354,11355,Abigail,2004,F,AK,37
6,11355,11356,Olivia,2004,F,AK,33
7,11356,11357,Isabella,2004,F,AK,30
8,11357,11358,Alyssa,2004,F,AK,29
9,11358,11359,Sophia,2004,F,AK,28


### Step 5. Delete the column 'Unnamed: 0' and 'Id'

In [None]:
baby_names=baby_names.drop(columns=['Unnamed: 0','Id'])

In [None]:
baby_names.head()

Unnamed: 0,Name,Year,Gender,State,Count
0,Emma,2004,F,AK,62
1,Madison,2004,F,AK,48
2,Hannah,2004,F,AK,46
3,Grace,2004,F,AK,44
4,Emily,2004,F,AK,41


### Step 6. What year has the highest number of baby names in the dataset?

In [None]:
# group by Year and count number of names
year_counts = baby_names.groupby('Year').size()

In [None]:
year_counts.idxmax()

np.int64(2008)

### Step 7. Is there more male or female names in the dataset?

In [None]:
gender_counts=baby_names.groupby('Gender').size()
gender_counts.idxmax()

'F'

In [None]:
#2
baby_names['Gender'].value_counts()

Unnamed: 0_level_0,count
Gender,Unnamed: 1_level_1
F,558846
M,457549


### Step 8. Group the dataset by name and assign to names

In [None]:
baby_names.head()

Unnamed: 0,Name,Gender,State,Count
0,Emma,F,AK,62
1,Madison,F,AK,48
2,Hannah,F,AK,46
3,Grace,F,AK,44
4,Emily,F,AK,41


In [None]:
# group the data
names = baby_names.groupby("Name")
names.head()

Unnamed: 0,Name,Gender,State,Count
0,Emma,F,AK,62
1,Madison,F,AK,48
2,Hannah,F,AK,46
3,Grace,F,AK,44
4,Emily,F,AK,41
...,...,...,...,...
1004923,Gryffin,M,WI,5
1004950,Kroy,M,WI,5
1004973,Owyn,M,WI,5
1005707,Haylea,F,WV,5


### Step 9. How many different names exist in the dataset?

In [None]:
baby_names['Name'].nunique()

17632

### Step 10. What is the name with most occurrences?

In [None]:
baby_names['Name'].value_counts().sort_values(ascending=False)

Unnamed: 0_level_0,count
Name,Unnamed: 1_level_1
Riley,1112
Avery,1080
Jordan,1073
Peyton,1064
Hayden,1049
...,...
Magdala,1
Sherlynn,1
Nephtalie,1
Catriona,1


### Step 11. How many different names have the least occurrences?

In [None]:
min_count=baby_names['Name'].value_counts()[lambda x: x == 1]

In [None]:
min_count.count()

np.int64(3682)

### Step 12. What is the median name occurrence?

In [None]:
name_total_counts = baby_names.groupby('Name')['Count'].sum()
median_occurrence = name_total_counts.median()
print(f"The median name occurrence is: {median_occurrence}")

The median name occurrence is: 49.0


### Step 13. What is the standard deviation of names?

In [None]:
std_occurrence = name_total_counts.std()
print(f"The standard deviation of name occurrences is: {std_occurrence}")

The standard deviation of name occurrences is: 11006.069467891111


### Step 14. Get a summary with the mean, min, max, std and quartiles.

In [None]:
name_total_counts.describe()

Unnamed: 0,Count
count,17632.0
mean,2008.932169
std,11006.069468
min,5.0
25%,11.0
50%,49.0
75%,337.0
max,242874.0


# Task
Calculate and display the median of the total occurrences for each baby name in the `baby_names` DataFrame.

## Calculate total occurrences for each name

### Subtask:
Group the `baby_names` DataFrame by 'Name' and sum the 'Count' to get the total occurrences for each unique name.


**Reasoning**:
To calculate the total occurrences for each name as requested in the subtask, I will group the `baby_names` DataFrame by the 'Name' column and then sum the 'Count' column for each group. The result will be stored in a new variable called `name_total_counts`.



In [None]:
name_total_counts = baby_names.groupby('Name')['Count'].sum()
print(name_total_counts.head())

Name
Aaban        12
Aadan        23
Aadarsh       5
Aaden      3426
Aadhav        6
Name: Count, dtype: int64


**Reasoning**:
The subtask is to calculate and display the median of the total occurrences for each baby name. Since `name_total_counts` now holds the total occurrences for each name, the next step is to calculate its median and display it.



In [None]:
median_total_occurrences = name_total_counts.median()
print(f"The median of total occurrences for each baby name is: {median_total_occurrences}")

The median of total occurrences for each baby name is: 49.0


## Summary:

### Q&A
The median of the total occurrences for each baby name is 49.0.

### Data Analysis Key Findings
*   The total occurrences for each unique baby name were calculated by summing the 'Count' for each 'Name'. For instance, 'Aaban' had 12 total occurrences, 'Aadan' had 23, 'Aadarsh' had 5, 'Aaden' had 3426, and 'Aadhav' had 6.
*   The median of these total occurrences across all baby names was determined to be 49.0.

### Insights or Next Steps
*   The median of 49.0 suggests that at least half of the unique baby names have been recorded 49 times or fewer in total. This indicates a long tail of less common names.
*   Further analysis could involve examining the distribution of total occurrences (e.g., using quartiles, histograms, or density plots) to understand the full range of name popularity and identify extremely popular or rare names.
