# Intro

For this assignment, we are going to calculate summaries from the **NFL Scores** dataset. First, we will calculate the summaries utilizing python's ```map/reduce``` functionality, then we will calculate the summaries using ```pyspark```'s *spark SQL* and *pandas-on-spark*.

Let's start by reading in some libraries that we'll need.

In [103]:
import pandas as pd
import numpy as np
from numpy import round
import functools

# Part 1

To utilize python's ```map/reduce``` functionality, we will first need to split the data set into separate 'chunks.' So let's read in the data as a ```pandas``` data frame and take a look at the first few rows of data.

In [104]:
nfl_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/scoresFull.csv")
nfl_data.head()

Unnamed: 0,week,date,day,season,awayTeam,AQ1,AQ2,AQ3,AQ4,AOT,...,homeFumLost,homeNumPen,homePenYds,home3rdConv,home3rdAtt,home4thConv,home4thAtt,homeTOP,HminusAScore,homeSpread
0,1,5-Sep,Thu,2002,San Francisco 49ers,3,0,7,6,-1,...,0,10,80,4,8,0,1,32.47,-3,-4.0
1,1,8-Sep,Sun,2002,Minnesota Vikings,3,17,0,3,-1,...,1,4,33,2,6,0,0,28.48,4,4.5
2,1,8-Sep,Sun,2002,New Orleans Saints,6,7,7,0,6,...,0,8,85,1,6,0,1,31.48,-6,6.0
3,1,8-Sep,Sun,2002,New York Jets,0,17,3,11,6,...,1,10,82,4,8,2,2,39.13,-6,-3.0
4,1,8-Sep,Sun,2002,Arizona Cardinals,10,3,3,7,-1,...,0,7,56,6,10,1,2,34.4,8,6.0


For this exercise, let's go ahead split the datasets 'chunks' according to ```season```, select **HQ4** (Home Team 4th Quarter Points) as the variable to summarize and group the results according to the **date**. Since the date contains both the month and day of the game, we'll make the grouping a little easier and eliminate the day so that we are grouping on **month** only.

The following ```for loop``` removes the day from the date column. We confirm the loop ran correctly by identifying the unique values in the date column.

In [105]:
for index, row in nfl_data.iterrows():
  month = row['date'][-3:]
  nfl_data.at[index, 'date'] = month

np.unique(np.array(nfl_data['date']))

array(['Dec', 'Feb', 'Jan', 'Nov', 'Oct', 'Sep'], dtype=object)

Let's use the same ```np.unique()``` code to identify all the different seasons.

In [106]:
np.unique(np.array(nfl_data['season']))

array([2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012,
       2013, 2014])

Now we'll split the data into separate *csv* files using a ```for loop```.

In [107]:
for i in range(2002, 2015):
    nfl_data.loc[nfl_data["season"] == i].to_csv('nfl'+ str(i) +'.csv', index=False)

Lastly, we'll read the *csv* files back into a list, so that our end result will be a list containing the thirteen separate data sets.

In [108]:
nfl_sets = []
for i in range(2002, 2015):
    year = pd.read_csv('nfl'+str(i)+'.csv')
    nfl_sets.append(year)
#a = pd.read_csv('nfl2002.csv')
#nfl_sets.append()

We'll utilize the ```len()``` function to confirm that our list contains the data sets.

(We can also run ```nfl_sets[0]``` to check that the first index of the list is our first data set.)

In [109]:
len(nfl_sets)

13

# Part 2

For the **HQ4** variable, we are going to want to sum it across the grouping variable (**date**), sqaure it and sum that result across the grouping variable, and lastly, count it across the grouping variable. To do this, we'll want to create a dictionary with the *key* equal to the grouping variable and the *values* equal to a list containing the three measures (sum, sum of squared, count).

The following function will take in a data set, grouping variable, and summary variable. It creates an empty dictionary, and then it will iterate over a data frame utilizing a ```for loop```. It will iterate through each row of the data frame first looking to see if the grouping variable exists in the dictionary ```group_dict```. If it does, then it will add the variable for that row to the existing sum, square the variable for that row and add it to an existing sum, and add one more to the count measure. If the grouping variable does not exist, then it will create a *key* for the group variable and populate the *values* with the variable value, variable squared, and count of 1.

In [110]:
def map_vars(data, group, var):
  group_dict = {}
  for index, row in data.iterrows():
    if row[group] in group_dict:
      group_dict[row[group]] = [(group_dict[row[group]][0] + row[var]), (group_dict[row[group]][1] + (row[var]**2)), (group_dict[row[group]][2] + 1)]
    else:
      group_dict[row[group]] = [row[var], row[var]**2, 1]
  return group_dict

Let's run a quick test on one of our data sets to see that it works.

In [111]:
map_vars(nfl_sets[0], 'date', 'HQ4')

{'Sep': [391, 4129, 60],
 'Oct': [373, 4293, 56],
 'Nov': [300, 2582, 62],
 'Dec': [557, 6347, 78],
 'Jan': [95, 1389, 11]}

That's great! Our result is exactly what we want - a dictionary with *key*'s  as the grouping variable and *values* as a list of *sum, sum of squares, and count*.

Now we're going to want to ```map``` this function to each of our thirteen data sets. The ```map``` function will take in a function as an argument as well a list of data sets the function will be applied to. But we will also stipulate the ```group``` and ```var``` variables for each iteration of the function. So we will create a list of the grouping and summary variables that are repeated and of the same length as our list of data sets.

In [112]:
arg2=['date']*13
arg3=['HQ4']*13

Since the ```map``` function creates a mapping object, we'll utilize ```list()``` to create a list of our mapping results.

In [113]:
mapped = list(map(map_vars, nfl_sets, arg2, arg3))


Let's access a few indices of our list to confirm that our mapping was successful.

In [114]:
mapped[0]

{'Sep': [391, 4129, 60],
 'Oct': [373, 4293, 56],
 'Nov': [300, 2582, 62],
 'Dec': [557, 6347, 78],
 'Jan': [95, 1389, 11]}

In [115]:
mapped[1]

{'Sep': [422, 4506, 60],
 'Oct': [291, 3029, 56],
 'Nov': [499, 5425, 75],
 'Dec': [389, 4227, 65],
 'Jan': [61, 569, 10],
 'Feb': [19, 361, 1]}

Great! It looks like our mapping function was a success. We now have a list of thirteen dictionaries, each with grouping results for each season or data 'chunk.'

Now we need to combine our separate dictionaries into one dictionary. Let's start by writing a function that will combine two separate dictionaries into a single dictionary. Below, we have the function take in two dictionaries and start by creating an empty dictionay. We then iterate through the keys of the first dictionary utilizing a ```for loop``` combined with ```if/else``` logic: if the *key* in the first dictionary is also in the second dictionary, then we add that *key* to the empty dictionary and combine (add) the *values* of the key; if the *key* value is not in the second dictionary, then we simply copy the *key* and *values* of the first dictionary to the empty dictionary; lastly, we check if there were any *keys* in the second dictionary that were not in the first and copy those *key* and *values* to the (previously) empty dictionary. We return the new, combined dictionary at the end of the function.

In [116]:
def reduce_vars(dict1, dict2):
  combined = {}
  for key in dict1:
    if key in dict2:
      combined[key] = [dict1[key][0]+dict2[key][0], dict1[key][1]+dict2[key][1], dict1[key][2]+dict2[key][2]]
    else:
      combined[key] = dict1[key]
  for key in dict2:
    if key not in dict1:
      combined[key] = dict2[key]
  return combined

Let's run this function on two of our data frames contained in the ```list``` **mapped**.

In [117]:
reduce_vars(mapped[0], mapped[1])

{'Sep': [813, 8635, 120],
 'Oct': [664, 7322, 112],
 'Nov': [799, 8007, 137],
 'Dec': [946, 10574, 143],
 'Jan': [156, 1958, 21],
 'Feb': [19, 361, 1]}

We can look back on the contents of ```mapped[0]``` and ```mapped[1]``` and see that they were successfully combined!

Now we will utilize ```reduce()``` from the ```functools``` library to apply our ```reduce_vars()``` function across all the dictionaries in the ```list``` **mapped**. ```reduce()``` will repeat the ```reduce_vars``` function, combining the first two data frames and then adding/combining each successive data frame to the 'running total' dictionary.

In [118]:
final = functools.reduce(reduce_vars, mapped)
final

{'Sep': [4456, 46934, 709],
 'Oct': [5204, 58742, 796],
 'Nov': [5307, 58377, 836],
 'Dec': [5513, 60091, 909],
 'Jan': [1378, 15314, 209],
 'Feb': [89, 1147, 12]}

Now let's transform these values into something more statistically interpretable - **average** and **standard deviation**

We can utilize a ```for loop``` to iterate through each *key* in our dictionary and create the **mean** and **standard deviation** from the values in the dictionary.

In [119]:
def summation(dict_):
  for key in dict_:
    sum = dict_[key][0]
    sumsqrd = dict_[key][1]
    count = dict_[key][2]
    mean = sum/count
    dict_[key] = [round(mean, 2), round((np.sqrt((sumsqrd - (count*(mean**2)))/(count-1))), 2)]
  return dict_

In [120]:
summation(final)

{'Sep': [6.28, 5.17],
 'Oct': [6.54, 5.58],
 'Nov': [6.35, 5.44],
 'Dec': [6.06, 5.42],
 'Jan': [6.59, 5.47],
 'Feb': [7.42, 6.65]}

Looks great!

Lastly, let's go ahead and write a wrapper function that will take in our list of data frames and *any* grouping variable and *any* summary variable and return the **mean** and **standard deviation** of the summary variable across the grouping variable.

In [121]:
def wrap_func(data, group, var):
  arg2 = [group]*len(data)
  arg3 = [var]*len(data)
  maps = list(map(map_vars, data, arg2, arg3))
  maps_reduce = functools.reduce(reduce_vars, maps)
  result = summation(maps_reduce)
  return result


Let's see how it does on a different grouping and summary variables.

In [122]:
wrap_func(nfl_sets, 'week', 'AFinal')

{'1': [19.72, 8.98],
 '2': [19.55, 10.43],
 '3': [20.77, 9.63],
 '4': [20.98, 10.06],
 '5': [20.53, 10.3],
 '6': [20.68, 10.27],
 '7': [21.11, 10.67],
 '8': [20.15, 9.78],
 '9': [22.15, 9.7],
 '10': [21.54, 10.02],
 '11': [19.59, 10.02],
 '12': [21.51, 10.8],
 '13': [20.44, 10.13],
 '14': [19.39, 10.2],
 '15': [20.94, 11.22],
 '16': [20.8, 10.33],
 '17': [19.24, 10.64],
 'WildCard': [21.33, 10.13],
 'Division': [21.0, 9.72],
 'ConfChamp': [20.88, 8.14],
 'SuperBowl': [27.62, 10.11]}

Fantastic!

# Part 3

Now let's try getting some summary stats from the same data set using 'spark SQl' from ```pyspark```.

We'll get the same summary stats as above, **mean** and **standard deviation**, but instead of one summary variable, we'll get the stats for **ALL** the point variables ('AQ1', 'AQ2', 'AQ3', 'AQ4', 'AFinal', 'HQ1', 'HQ2', 'HQ3', 'HQ4', 'HFinal').

Let's start by setting up a ```SparkSession```

In [123]:
from pyspark.sql.functions import *
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local[*]').appName('my_app').getOrCreate()

Next, we'll create a *spark* data frame from our data set.

In [124]:
nfl_sp = spark.createDataFrame(nfl_data)

We can check to make sure the dataframe loaded correctly by inpsecting the schema and columns.

In [125]:
# nfl_sp.printSchema() # to save space, I won't execute the cell; the data frame loaded correctly.
# nfl_sp.columns

The great thing about 'spark SQL' is that we can utilize SQL coding directly. Here, we'll select all of our summary variables and perform the mean and standard deviation calculations through the ```.agg()``` function.  I've added some additional code that will round the results and add concise column headers so the results can be more easily read in the print out.

In [126]:
nfl_sp.select(['AQ1', 'AQ2', 'AQ3', 'AQ4', 'AFinal', 'HQ1', 'HQ2', 'HQ3', 'HQ4', 'HFinal']) \
.agg(round(avg('AQ1'),2).alias('meanA1'), round(std('AQ1'),2).alias('stdA1'), \
     round(avg('AQ2'),2).alias('meanA2'), round(std('AQ2'),2).alias('stdA2'), \
     round(avg('AQ3'),2).alias('meanA3'), round(std('AQ3'),2).alias('stdA3'), \
     round(avg('AQ4'),2).alias('meanA4'), round(std('AQ4'),2).alias('stdA4'), \
     round(avg('AFinal'),2).alias('meanAF'), round(std('AFinal'),2).alias('stdAF'), \
     round(avg('HQ1'),2).alias('meanH1'), round(std('HQ1'),2).alias('stdH1'), \
     round(avg('HQ2'),2).alias('meanH2'), round(std('HQ2'),2).alias('stdH2'), \
     round(avg('HQ3'),2).alias('meanH3'), round(std('HQ3'),2).alias('stdH3'), \
     round(avg('HQ4'),2).alias('meanH4'), round(std('HQ4'),2).alias('stdH4'), \
     round(avg('HFinal'),2).alias('meanHF'), round(std('HFinal'),2).alias('stdHF')) \
.show()



+------+-----+------+-----+------+-----+------+-----+------+-----+------+-----+------+-----+------+-----+------+-----+------+-----+
|meanA1|stdA1|meanA2|stdA2|meanA3|stdA3|meanA4|stdA4|meanAF|stdAF|meanH1|stdH1|meanH2|stdH2|meanH3|stdH3|meanH4|stdH4|meanHF|stdHF|
+------+-----+------+-----+------+-----+------+-----+------+-----+------+-----+------+-----+------+-----+------+-----+------+-----+
|  3.92| 4.49|  6.24| 5.22|  4.39| 4.63|  5.89| 5.28| 20.56| 10.2|  4.83| 4.73|  7.11|  5.7|  4.79| 4.76|  6.32| 5.42| 23.17|10.41|
+------+-----+------+-----+------+-----+------+-----+------+-----+------+-----+------+-----+------+-----+------+-----+------+-----+



                                                                                

And if we want to group these calculations, we can simply add a ```.groupby()``` argument.

In [127]:
nfl_sp.select(['AQ1', 'AQ2', 'AQ3', 'AQ4', 'AFinal', 'HQ1', 'HQ2', 'HQ3', 'HQ4', 'HFinal', 'season']) \
.groupBy('season') \
.agg(round(avg('AQ1'),2).alias('meanA1'), round(std('AQ1'),2).alias('stdA1'), \
     round(avg('AQ2'),2).alias('meanA2'), round(std('AQ2'),2).alias('stdA2'), \
     round(avg('AQ3'),2).alias('meanA3'), round(std('AQ3'),2).alias('stdA3'), \
     round(avg('AQ4'),2).alias('meanA4'), round(std('AQ4'),2).alias('stdA4'), \
     round(avg('AFinal'),2).alias('meanAF'), round(std('AFinal'),2).alias('stdAF'), \
     round(avg('HQ1'),2).alias('meanH1'), round(std('HQ1'),2).alias('stdH1'), \
     round(avg('HQ2'),2).alias('meanH2'), round(std('HQ2'),2).alias('stdH2'), \
     round(avg('HQ3'),2).alias('meanH3'), round(std('HQ3'),2).alias('stdH3'), \
     round(avg('HQ4'),2).alias('meanH4'), round(std('HQ4'),2).alias('stdH4'), \
     round(avg('HFinal'),2).alias('meanHF'), round(std('HFinal'),2).alias('stdHF')) \
.show()

+------+------+-----+------+-----+------+-----+------+-----+------+-----+------+-----+------+-----+------+-----+------+-----+------+-----+
|season|meanA1|stdA1|meanA2|stdA2|meanA3|stdA3|meanA4|stdA4|meanAF|stdAF|meanH1|stdH1|meanH2|stdH2|meanH3|stdH3|meanH4|stdH4|meanHF|stdHF|
+------+------+-----+------+-----+------+-----+------+-----+------+-----+------+-----+------+-----+------+-----+------+-----+------+-----+
|  2002|  4.04| 4.47|  6.02| 5.17|  4.45| 4.64|  5.93| 5.28| 20.64| 10.3|  4.31| 4.55|  7.41| 5.92|  4.72| 4.72|  6.43| 5.38| 23.02| 10.3|
|  2003|  3.56| 4.32|  6.09| 5.44|  3.91|  4.4|  5.36| 5.15| 19.11|10.21|  5.03| 4.68|  6.63| 5.37|  4.56| 4.72|   6.3| 5.32| 22.68|10.11|
|  2004|  3.91| 4.56|  6.27| 5.09|  4.15| 4.63|   5.8|  5.4| 20.24|10.16|  4.94| 4.73|  7.07| 5.59|  4.25| 4.68|  6.56| 5.16| 22.91|10.44|
|  2005|  3.89| 4.52|  5.54| 4.95|  4.03| 4.48|  5.19| 4.75| 18.79| 9.93|  4.39| 4.59|  7.43|  5.7|  4.68| 4.49|  5.74| 5.17| 22.31| 9.77|
|  2006|  3.63| 3.99|  6.18

And that's it!

# Part 4

Now let's perform the same calculations using 'pandas-on-spark' from ```pyspark```.

We'll start by importing the ```pandas``` functionality from ```pyspark``` and then importing our data set as a pandas-spark dataframe.

In [128]:
import pyspark.pandas as ps

In [129]:
nfl_ps = ps.from_pandas(nfl_data)
# nfl_ps.head() # we can utilize a lot of the same pandas functions, including .head() to check that the data read in correctly (it did!).

Since we can utilize a lot of the same pandas functions, we can use ```.describe()``` over our selected columns to return the **mean** and **standard deviation**.  Since ```.describe()``` returns a number of data summaries, I subsetted the result to only return **mean** and **std**.

In [130]:
nfl_ps[['AQ1', 'AQ2', 'AQ3', 'AQ4', 'AFinal', 'HQ1', 'HQ2', 'HQ3', 'HQ4', 'HFinal']].describe()[1:3]

                                                                                

Unnamed: 0,AQ1,AQ2,AQ3,AQ4,AFinal,HQ1,HQ2,HQ3,HQ4,HFinal
mean,3.924806,6.241429,4.38692,5.890233,20.557188,4.828868,7.105157,4.791126,6.322962,23.174013
std,4.4907,5.221593,4.632717,5.278775,10.195586,4.726903,5.702788,4.755145,5.41731,10.405952


To get the same summaries subsetted by year, we'll create two separate dataframes - one giving us the **mean** of our variables grouped by seasons and the other giving us the **std** of our variables grouped by season. Then we simply join the two data frames with ```.join()```. We can distinguish the column names using 'rsuffix' and 'lsuffix.'

In [131]:
nfl_ps[['AQ1', 'AQ2', 'AQ3', 'AQ4', 'AFinal', 'HQ1', 'HQ2', 'HQ3', 'HQ4', 'HFinal', 'season']].groupby('season').mean() \
.join(nfl_ps[['AQ1', 'AQ2', 'AQ3', 'AQ4', 'AFinal', 'HQ1', 'HQ2', 'HQ3', 'HQ4', 'HFinal', 'season']].groupby('season').std(), rsuffix='std', lsuffix='avg')



Unnamed: 0_level_0,AQ1avg,AQ2avg,AQ3avg,AQ4avg,AFinalavg,HQ1avg,HQ2avg,HQ3avg,HQ4avg,HFinalavg,AQ1std,AQ2std,AQ3std,AQ4std,AFinalstd,HQ1std,HQ2std,HQ3std,HQ4std,HFinalstd
season,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2002,4.037453,6.022472,4.449438,5.928839,20.640449,4.307116,7.411985,4.715356,6.426966,23.018727,4.470297,5.16781,4.639673,5.282133,10.296996,4.54998,5.923282,4.716938,5.384231,10.295065
2003,3.558052,6.093633,3.913858,5.355805,19.11236,5.026217,6.625468,4.561798,6.29588,22.677903,4.320719,5.440954,4.396264,5.148618,10.211049,4.681919,5.367875,4.719871,5.321832,10.105887
2004,3.913858,6.265918,4.153558,5.797753,20.2397,4.94382,7.071161,4.250936,6.561798,22.906367,4.556669,5.094638,4.631822,5.400181,10.16271,4.733562,5.587193,4.675238,5.155379,10.441145
2005,3.88764,5.543071,4.029963,5.191011,18.786517,4.393258,7.426966,4.677903,5.737828,22.314607,4.522554,4.947945,4.476236,4.753811,9.926578,4.586556,5.70494,4.489042,5.173548,9.772905
2006,3.629213,6.179775,4.303371,6.06367,20.254682,4.606742,6.041199,4.565543,5.932584,21.258427,3.986015,5.157435,4.531605,5.539984,10.269065,4.999138,5.199966,4.931715,5.262574,9.876525
2007,3.696629,6.220974,4.337079,5.872659,20.228464,5.026217,7.074906,4.913858,6.018727,23.157303,4.324448,5.050111,4.507997,5.283565,10.572177,4.824287,5.780607,4.455721,6.0103,10.500518
2008,3.801498,6.498127,4.074906,6.423221,20.842697,5.179775,7.205993,4.595506,6.044944,23.183521,4.203686,5.337747,4.242863,5.498864,10.279806,4.85712,5.825685,4.70095,5.171299,10.414349
2009,3.868914,6.202247,4.299625,5.94382,20.382022,4.737828,7.790262,4.228464,5.88764,22.779026,4.748728,5.285488,4.714412,5.571192,10.7436,4.731771,6.208245,4.723587,5.17681,10.78811
2010,3.973783,6.865169,4.629213,5.70412,21.318352,4.576779,6.771536,4.868914,6.681648,23.0,4.727466,5.396722,4.784959,4.953022,10.278809,4.485221,5.476904,4.753476,5.626849,10.23006
2011,3.857678,5.981273,4.58427,5.996255,20.509363,5.022472,7.333333,5.033708,6.479401,23.981273,4.455541,5.281869,4.628635,5.120722,9.64769,4.859633,5.673886,4.962527,5.262521,10.527962


And that's it!

It looks like a lot of built-in features for 'spark SQL' and 'pandas-on-spark' work seamlessly with ```pySparks``` RDD - Resilient Distributed Datasets. Since data in spark is stored in RDDs, the data is spread out in 'chunks' and in order to perform any action on the data, the action has to be performed on each 'chunk' and then the 'chunks' are combined - much like what we did in **Part 2**. But with 'spark SQL' and 'pandas-on-spark', we don't have to utilize ```map/reduce``` on all of our actions - these API's are built so that much of ```map/reduce``` is happening in the background and we can interact with spark data using the same functionality we know from ```pandas``` and *SQL*.