# US - Baby Names

### Introduction:

We are going to use a subset of [US Baby Names](https://www.kaggle.com/kaggle/us-baby-names) from Kaggle.  
In the file it will be names from 2004 until 2014


### Step 1. Import the necessary libraries

In [9]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as func

In [2]:
spark = SparkSession.builder.appName("baby-names").getOrCreate()

23/01/26 21:27:17 WARN Utils: Your hostname, Ana-Matebook resolves to a loopback address: 127.0.1.1; using 192.168.1.137 instead (on interface wlp2s0)
23/01/26 21:27:17 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/01/26 21:27:19 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv). 

### Step 3. Assign it to a variable called baby_names.

In [3]:
baby_names = spark.read.option("header", True).option("inferSchema", True).csv("US_Baby_Names_right.csv")

                                                                                

### Step 4. See the first 10 entries

In [6]:
baby_names.show(n=10)

23/01/26 21:28:58 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , Id, Name, Year, Gender, State, Count
 Schema: _c0, Id, Name, Year, Gender, State, Count
Expected: _c0 but found: 
CSV file: file:///home/avillalbacantero/Documents/Career/Self-Training/pyspark_exercises/06_Stats/US_Baby_Names/US_Baby_Names_right.csv
+-----+-----+--------+----+------+-----+-----+
|  _c0|   Id|    Name|Year|Gender|State|Count|
+-----+-----+--------+----+------+-----+-----+
|11349|11350|    Emma|2004|     F|   AK|   62|
|11350|11351| Madison|2004|     F|   AK|   48|
|11351|11352|  Hannah|2004|     F|   AK|   46|
|11352|11353|   Grace|2004|     F|   AK|   44|
|11353|11354|   Emily|2004|     F|   AK|   41|
|11354|11355| Abigail|2004|     F|   AK|   37|
|11355|11356|  Olivia|2004|     F|   AK|   33|
|11356|11357|Isabella|2004|     F|   AK|   30|
|11357|11358|  Alyssa|2004|     F|   AK|   29|
|11358|11359|  Sophia|2004|     F|   AK|   28|
+-----+-----+--------+----+------+-----+----

### Step 5. Delete the column 'Unnamed: 0' and 'Id'

In [7]:
baby_names = baby_names.drop("_c0", "Id")

In [8]:
baby_names.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Year: integer (nullable = true)
 |-- Gender: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Count: integer (nullable = true)



### Step 6. Is there more male or female names in the dataset?

In [10]:
baby_names.groupBy("Gender").agg(func.count("Name")).show()



+------+-----------+
|Gender|count(Name)|
+------+-----------+
|     F|     558846|
|     M|     457549|
+------+-----------+



                                                                                

### Step 7. Group the dataset by name and assign to names

### Step 8. How many different names exist in the dataset?

In [15]:
baby_names.select("Name").distinct().count()

                                                                                

17632

In [24]:
baby_names.groupBy("Name").agg(func.sum("Count").alias("count_name")).sort("count_name", ascending=False).show()



+-----------+----------+
|       Name|count_name|
+-----------+----------+
|      Jacob|    242874|
|       Emma|    214852|
|    Michael|    214405|
|      Ethan|    209277|
|   Isabella|    204798|
|    William|    197894|
|     Joshua|    191551|
|     Sophia|    191446|
|     Daniel|    191440|
|      Emily|    190318|
|     Olivia|    188036|
|  Alexander|    187189|
|    Matthew|    185279|
|       Noah|    179925|
|    Anthony|    179256|
|     Andrew|    174975|
|Christopher|    172997|
|     Joseph|    169972|
|      David|    167606|
|        Ava|    167369|
+-----------+----------+
only showing top 20 rows



                                                                                

### Step 9. What is the name with most occurrences?

In [None]:
# Jacob

### Step 10. How many different names have the least occurrences?

In [32]:
counts = baby_names.groupBy("Name").agg(func.sum("Count").alias("count_name")).sort("count_name", ascending=True)
counts.show()

+------------+----------+
|        Name|count_name|
+------------+----------+
|        Siah|         5|
|       Linsy|         5|
|     Yanette|         5|
|      Breezy|         5|
|       Tejon|         5|
|        Roxi|         5|
|    Chantell|         5|
|      Mantej|         5|
|   Prabhleen|         5|
|     Clariza|         5|
|   Bethlehem|         5|
|     Manisha|         5|
|     Zareena|         5|
|       Siris|         5|
|Angeldejesus|         5|
|   Jezebelle|         5|
|       Kavir|         5|
|       Shian|         5|
|     Aaminah|         5|
|   Francella|         5|
+------------+----------+
only showing top 20 rows



                                                                                

In [35]:
min_count = counts.agg(func.min("count_name"))
min_count.show()



+---------------+
|min(count_name)|
+---------------+
|              5|
+---------------+



                                                                                

In [37]:
counts.filter(counts.count_name == 5).count()

                                                                                

2578

### Step 11. What is the median name occurrence?

In [28]:
baby_names.groupBy("Name").agg(func.percentile_approx("Count", 0.5).alias("median")).sort("median", ascending=False).show()

[Stage 53:>                                                         (0 + 5) / 5]

+---------+------+
|     Name|median|
+---------+------+
|     Emma|   278|
|  William|   263|
|    Jacob|   252|
|    Ethan|   245|
|   Olivia|   228|
|    James|   209|
|  Michael|   204|
| Isabella|   203|
|   Joshua|   196|
|    Emily|   195|
|      Ava|   194|
|  Madison|   189|
|   Sophia|   188|
|  Abigail|   188|
|     Noah|   185|
|   Andrew|   182|
|  Jackson|   180|
|   Joseph|   175|
|Alexander|   174|
|   Samuel|   174|
+---------+------+
only showing top 20 rows



                                                                                

In [38]:
counts.agg(func.percentile_approx("count_name", 0.5)).show()

                                                                                

+-----------------------------------------+
|percentile_approx(count_name, 0.5, 10000)|
+-----------------------------------------+
|                                       49|
+-----------------------------------------+



### Step 12. What is the standard deviation of names?

In [27]:
baby_names.groupBy("Name").agg(func.stddev("Count")).show()



+--------+------------------+
|    Name|stddev_samp(Count)|
+--------+------------------+
|   Kiana|23.162293881598007|
|  Alayna| 25.85988087814688|
|   Ember| 9.404852177442312|
|   Tyler| 227.7471939892703|
|  Maddox| 38.41906211295464|
|  Kellen|12.267182240040928|
|  Heaven| 29.25723231112382|
|Julianne|12.078546840341312|
| Susanna|3.9785406711462277|
|  Kenlee| 3.051028011969828|
|    Kloe|  5.26400534972956|
|   Anyah|2.0939311949035817|
|   Tegan| 4.688324787315009|
| Jazzlyn|11.468219817804055|
|Brileigh|1.7917941611104422|
|Analeigh| 5.801596444364098|
|Kamarion|3.8065643188655995|
|   Aryan|14.245126678533055|
| Galilea|  30.1926799162337|
|    Faye| 6.371150253726211|
+--------+------------------+
only showing top 20 rows



                                                                                

In [40]:
counts.agg(func.stddev("count_name")).show()

                                                                                

+-----------------------+
|stddev_samp(count_name)|
+-----------------------+
|      11006.06946789057|
+-----------------------+



### Step 13. Get a summary with the mean, min, max, std and quartiles.

In [42]:
baby_names.describe().show()



+-------+--------+------------------+-------+-------+-----------------+
|summary|    Name|              Year| Gender|  State|            Count|
+-------+--------+------------------+-------+-------+-----------------+
|  count| 1016395|           1016395|1016395|1016395|          1016395|
|   mean|Infinity|2009.0531899507573|   null|   null|34.85012421351935|
| stddev|    null|3.1382928281811524|   null|   null|97.39734648617832|
|    min|   Aaban|              2004|      F|     AK|                5|
|    max|  Zyriah|              2014|      M|     WY|             4167|
+-------+--------+------------------+-------+-------+-----------------+



                                                                                