# Regiment

### Introduction:

Special thanks to: http://chrisalbon.com/ for sharing the dataset and materials.

### Step 1. Import the necessary libraries

In [1]:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("regiment").getOrCreate()
spark

### Step 2. Create the DataFrame with the following values:

In [3]:
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 
        'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 
        'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 
        'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}

### Step 3. Assign it to a variable called regiment.
#### Don't forget to name each column

In [2]:
import pandas as pd

In [5]:
regiment = spark.createDataFrame(pd.DataFrame(data=raw_data))
regiment.show(5)

+----------+-------+--------+------------+-------------+
|  regiment|company|    name|preTestScore|postTestScore|
+----------+-------+--------+------------+-------------+
|Nighthawks|    1st|  Miller|           4|           25|
|Nighthawks|    1st|Jacobson|          24|           94|
|Nighthawks|    2nd|     Ali|          31|           57|
|Nighthawks|    2nd|  Milner|           2|           62|
|  Dragoons|    1st|   Cooze|           3|           70|
+----------+-------+--------+------------+-------------+
only showing top 5 rows



### Step 4. What is the mean preTestScore from the regiment Nighthawks?  

In [8]:
regiment.filter(regiment.regiment == "Nighthawks").agg({"preTestScore":"avg"}).show()

+-----------------+
|avg(preTestScore)|
+-----------------+
|            15.25|
+-----------------+



### Step 5. Present general statistics by company

In [13]:
import pyspark.sql.functions as F

In [21]:
regiment.groupby("company").agg(F.count('preTestScore').alias('count'), \
                                F.mean('preTestScore').alias('mean'), \
                                F.stddev('preTestScore').alias('std'), \
                     F.min('preTestScore').alias('min'), \
                                F.expr('percentile(preTestScore, array(0.25))')[0].alias('%25'), \
                     F.expr('percentile(preTestScore, array(0.5))')[0].alias('%50'), \
                     F.expr('percentile(preTestScore, array(0.75))')[0].alias('%75'), \
                                F.max('preTestScore').alias('max')).show()

+-------+-----+-----------------+------------------+---+----+----+-----+---+
|company|count|             mean|               std|min| %25| %50|  %75|max|
+-------+-----+-----------------+------------------+---+----+----+-----+---+
|    2nd|    6|             15.5|14.652644812456213|  2|2.25|13.5|29.25| 31|
|    1st|    6|6.666666666666667| 8.524474568362947|  2| 3.0| 3.5|  4.0| 24|
+-------+-----+-----------------+------------------+---+----+----+-----+---+



### Step 6. What is the mean of each company's preTestScore?

In [26]:
regiment.groupBy("company").agg({"preTestScore": "mean"}).orderBy("company").show()

+-------+-----------------+
|company|avg(preTestScore)|
+-------+-----------------+
|    1st|6.666666666666667|
|    2nd|             15.5|
+-------+-----------------+



### Step 7. Present the mean preTestScores grouped by regiment and company

In [25]:
regiment.groupBy(["regiment","company"]).agg({"preTestScore": "mean"}).\
orderBy(["regiment","company"], ascending=1).show()

+----------+-------+-----------------+
|  regiment|company|avg(preTestScore)|
+----------+-------+-----------------+
|  Dragoons|    1st|              3.5|
|  Dragoons|    2nd|             27.5|
|Nighthawks|    1st|             14.0|
|Nighthawks|    2nd|             16.5|
|    Scouts|    1st|              2.5|
|    Scouts|    2nd|              2.5|
+----------+-------+-----------------+



### Step 8. Present the mean preTestScores grouped by regiment and company without heirarchical indexing

In [27]:
regiment.groupBy(["regiment","company"]).agg({"preTestScore": "mean"}).\
orderBy(["regiment","company"], ascending=1).show()

+----------+-------+-----------------+
|  regiment|company|avg(preTestScore)|
+----------+-------+-----------------+
|  Dragoons|    1st|              3.5|
|  Dragoons|    2nd|             27.5|
|Nighthawks|    1st|             14.0|
|Nighthawks|    2nd|             16.5|
|    Scouts|    1st|              2.5|
|    Scouts|    2nd|              2.5|
+----------+-------+-----------------+



### Step 9. Group the entire dataframe by regiment and company

In [29]:
regiment.groupBy(["regiment","company"]).mean().\
orderBy(["regiment","company"], ascending=1).show()

+----------+-------+-----------------+------------------+
|  regiment|company|avg(preTestScore)|avg(postTestScore)|
+----------+-------+-----------------+------------------+
|  Dragoons|    1st|              3.5|              47.5|
|  Dragoons|    2nd|             27.5|              75.5|
|Nighthawks|    1st|             14.0|              59.5|
|Nighthawks|    2nd|             16.5|              59.5|
|    Scouts|    1st|              2.5|              66.0|
|    Scouts|    2nd|              2.5|              66.0|
+----------+-------+-----------------+------------------+



### Step 10. What is the number of observations in each regiment and company

In [30]:
regiment.groupBy(["regiment","company"]).count().\
orderBy(["regiment","company"], ascending=1).show()

+----------+-------+-----+
|  regiment|company|count|
+----------+-------+-----+
|  Dragoons|    1st|    2|
|  Dragoons|    2nd|    2|
|Nighthawks|    1st|    2|
|Nighthawks|    2nd|    2|
|    Scouts|    1st|    2|
|    Scouts|    2nd|    2|
+----------+-------+-----+



### Step 11. Iterate over a group and print the name and the whole data from the regiment

In [37]:
regiments = regiment.select("regiment").distinct().collect()

In [39]:
for r in regiments:
    print(r[0])
    regiment.filter(regiment.regiment.isin(r[0])).show()

Nighthawks
+----------+-------+--------+------------+-------------+
|  regiment|company|    name|preTestScore|postTestScore|
+----------+-------+--------+------------+-------------+
|Nighthawks|    1st|  Miller|           4|           25|
|Nighthawks|    1st|Jacobson|          24|           94|
|Nighthawks|    2nd|     Ali|          31|           57|
|Nighthawks|    2nd|  Milner|           2|           62|
+----------+-------+--------+------------+-------------+

Dragoons
+--------+-------+------+------------+-------------+
|regiment|company|  name|preTestScore|postTestScore|
+--------+-------+------+------------+-------------+
|Dragoons|    1st| Cooze|           3|           70|
|Dragoons|    1st| Jacon|           4|           25|
|Dragoons|    2nd|Ryaner|          24|           94|
|Dragoons|    2nd|  Sone|          31|           57|
+--------+-------+------+------------+-------------+

Scouts
+--------+-------+-----+------------+-------------+
|regiment|company| name|preTestScore|po