# US - Baby Names

### Introduction:

We are going to use a subset of [US Baby Names](https://www.kaggle.com/kaggle/us-baby-names) from Kaggle.  
In the file it will be names from 2004 until 2014


### Step 1. Import the necessary libraries

In [1]:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("baby").getOrCreate()
spark

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv). 

### Step 3. Assign it to a variable called baby_names.

In [2]:
from pyspark import SparkFiles

In [3]:
url = "https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv"

spark.sparkContext.addFile(url)

baby_names = spark.read.csv(SparkFiles.get("US_Baby_Names_right.csv"), header=True, inferSchema=True, sep=',')
baby_names.show(5)

+-----+-----+-------+----+------+-----+-----+
|  _c0|   Id|   Name|Year|Gender|State|Count|
+-----+-----+-------+----+------+-----+-----+
|11349|11350|   Emma|2004|     F|   AK|   62|
|11350|11351|Madison|2004|     F|   AK|   48|
|11351|11352| Hannah|2004|     F|   AK|   46|
|11352|11353|  Grace|2004|     F|   AK|   44|
|11353|11354|  Emily|2004|     F|   AK|   41|
+-----+-----+-------+----+------+-----+-----+
only showing top 5 rows



### Step 4. See the first 10 entries

In [4]:
baby_names.show(10)

+-----+-----+--------+----+------+-----+-----+
|  _c0|   Id|    Name|Year|Gender|State|Count|
+-----+-----+--------+----+------+-----+-----+
|11349|11350|    Emma|2004|     F|   AK|   62|
|11350|11351| Madison|2004|     F|   AK|   48|
|11351|11352|  Hannah|2004|     F|   AK|   46|
|11352|11353|   Grace|2004|     F|   AK|   44|
|11353|11354|   Emily|2004|     F|   AK|   41|
|11354|11355| Abigail|2004|     F|   AK|   37|
|11355|11356|  Olivia|2004|     F|   AK|   33|
|11356|11357|Isabella|2004|     F|   AK|   30|
|11357|11358|  Alyssa|2004|     F|   AK|   29|
|11358|11359|  Sophia|2004|     F|   AK|   28|
+-----+-----+--------+----+------+-----+-----+
only showing top 10 rows



In [5]:
print(baby_names.count())

1016395


### Step 5. Delete the column 'Unnamed: 0' and 'Id'

In [7]:
cols_to_drop = ["_c0","Id"]
baby_names= baby_names.drop(*cols_to_drop)
baby_names.show(2)

+-------+----+------+-----+-----+
|   Name|Year|Gender|State|Count|
+-------+----+------+-----+-----+
|   Emma|2004|     F|   AK|   62|
|Madison|2004|     F|   AK|   48|
+-------+----+------+-----+-----+
only showing top 2 rows



### Step 6. Is there more male or female names in the dataset?

In [12]:
females = baby_names.filter(baby_names.Gender.startswith("F")).count()
males = baby_names.filter(baby_names.Gender.startswith("M")).count()

print("females:",females, "males:",males)
if females > males:
    print("More Females",females)
else:
    print("More Males",males)

females: 558846 males: 457549
More Females 558846


### Step 7. Group the dataset by name and assign to names

In [31]:
names = baby_names.groupBy("Name").count()
names = names.withColumnRenamed("count","name_count")
names.show(5)

+------+----------+
|  Name|name_count|
+------+----------+
| Kiana|       341|
|Alayna|       469|
| Ember|       262|
| Tyler|       770|
|Maddox|       537|
+------+----------+
only showing top 5 rows



### Step 8. How many different names exist in the dataset?

In [19]:
names.count()

17632

### Step 9. What is the name with most occurrences?

In [32]:
names.orderBy("name_count",ascending=0).show(5)

+------+----------+
|  Name|name_count|
+------+----------+
| Riley|      1112|
| Avery|      1080|
|Jordan|      1073|
|Peyton|      1064|
|Hayden|      1049|
+------+----------+
only showing top 5 rows



In [33]:
names.orderBy("name_count",ascending=0).head(1)[0][0]

'Riley'

In [34]:
names.filter(names.Name.startswith("Jacob")).show()

+------------+----------+
|        Name|name_count|
+------------+----------+
|      Jacoby|       330|
|       Jacob|       568|
|      Jacobe|        19|
|      Jacobi|       118|
|Jacobanthony|         2|
|      Jacobo|        44|
|     Jacobie|         9|
|     Jacobey|         1|
+------------+----------+



### Step 10. How many different names have the least occurrences?

In [25]:
from pyspark.sql.functions import *

In [35]:
min_count = names.agg({"name_count":"min"}).head(1)[0][0]

In [36]:
min_count

1

In [38]:
rare_names = names.filter(names.name_count.isin(min_count)).count()
rare_names

3682

### Step 11. What is the median name occurrence?

In [42]:
# df.approxQuantile("x", [0.5], 0.25)
median_count = names.approxQuantile("name_count",[0.5],0.0001)
median_count

[8.0]

### Step 12. What is the standard deviation of names?

In [44]:
names.describe(['name_count']).show()

+-------+------------------+
|summary|        name_count|
+-------+------------------+
|  count|             17632|
|   mean|57.644906987295826|
| stddev| 122.0299635081389|
|    min|                 1|
|    max|              1112|
+-------+------------------+



### Step 13. Get a summary with the mean, min, max, std and quartiles.

In [46]:
names.approxQuantile("name_count",[0.25,0.5,0.75],0.1)

[2.0, 6.0, 22.0]