## Statistics Assignment

All of the following are based on data from your project. Each student has to perform analysis on their own no collaboration between team members. Here are the datasets to be analyzed by team datasets:

***
### ***For Part 1 and 2***

##### Library-Computer-Usage-Analysis
* Computer Utilization Data by Date-Time

##### Volag
* Flight Delay Data by Date-Time

##### Slipper-Streets
* Crash Data by Date-Time

##### Corpus
* Reviews of Electronic product (laptop) by Date-Time

##### SteamConnect
* Early Access Score by Release Date-Time

##### Toxic-Crusaders
* Chemical Industry Release (pick a particular industry) by Date-Time

##### Uni-X
* Repayment Rate for Female gender by Date-Time

##### WRF
* Migration count by Date-Time


### Part 1

* **Conduct Decriptive Analytics (Mean, Median, Quartile) calculation by each division of Date-Time (most probably year or 6 month duration, if you have shorter use 1 month)**
* **Calculate divergence of mean and median in your data**
* **Visualize the data and draw inferences**

### Part 2
* **Conduct Probablity distibution analysis based on the data. Analyze your data based on the type of distribution it best fits (for PDF and CDF)**
* **Conduct Method of Moments analysis on your data to suggest the best fit distribution. Visualize the results**

***
***

### ***For Part 3***
* Compare with the variable with other variables in your project

### Part 3
* **Formulate a null hypothesis and evaluate it, perform correlation measures, and construct a linear regression model**

***

In [24]:
import pandas as pd
import os
import numpy as np
import random

In [4]:
# Get the data
# it's large enough that it can impact memory signigicantly on my machine (8gb total, would love 32gb)
# so let's read it in as chunck and "lazily" process things.
df = pd.read_csv(os.path.join('..','Data','flights_weather.csv'), chunksize=1000000,)

## Do some more data wrangling

Let's take all the information in the dataframe which represents the date, and then convert it to a single DateTime Object.

In [7]:
pd.options.display.max_columns = 99

'''
with open(os.path.join('..','Data','flights_weather_1.csv'), 'w') as f:
    for chunk in df:
        chunk['DATE'] = pd.to_datetime(
            chunk.YEAR*10000+chunk.MONTH*100+chunk.DAY,
            format='%Y%m%d'
        )
        # write chunk to new file
        chunk.to_csv(f, mode='a')
'''

  interactivity=interactivity, compiler=compiler, result=result)


In [25]:
# now we can read in the new data set which has a nice neat DateTime object column.
dtypes = {
    'ORIGIN_AIRPORT': 'str', 
    'DESTINATION_AIRPORT': 'str', 
    'IATA_CODE_x': 'str', 
    'origin_weather_station': 'str', 
    'IATA_CODE_y': 'str', 
    'destination_weather_station': 'str', 
    'OR_MAX': 'str', 
    'OR_MIN': 'str', 
    'OR_PRCP': 'str', 
    'DES_MAX': 'str', 
    'DES_MIN': 'str', 
    'DES_PRCP': 'str', 
    'OR_FRSHTT': 'str', 
    'DES_FRSHTT': 'str'
}
path = os.path.join('..','Data','flights_weather_1.csv')

### Our dataset is a little bit large

Since our dataset grew to about 2GB in size after merging it with the weather data, it's a bit much to load into
the memory of my system and also process. For that reason, a random sample of the dataset should be taken. 

In [68]:
#lines = sum(1 for l in open(path))

with open(path) as f:
    lines = sum(1 for l in f)
    
# sample ~30% of the dataset
sample_size = int(lines / 30)
skip = random.sample(range(1,lines), lines - sample_size)
# use skip lines
sample = pd.read_csv(path, skiprows=skip, dtype=dtypes)

In [69]:
sample = sample.loc[:, ~sample.columns.str.contains('^Unnamed')]
sample = sample.drop(['YEAR','MONTH','DAY'],axis=1)

In [46]:
sample.head()

Unnamed: 0,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY,DATE,IATA_CODE_x,origin_weather_station,IATA_CODE_y,destination_weather_station,OR_TEMP,OR_COUNT_TEMP,OR_DEWP,OR_COUNT_DEWP,OR_SLP,OR_COUNT_SLP,OR_STP,OR_COUNT_STP,OR_VISIB,OR_COUNT_VISIB,OR_WDSP,OR_COUNT_WDSP,OR_MXSPD,OR_GUST,OR_MAX,OR_MIN,OR_PRCP,OR_SNDP,OR_FRSHTT,DES_TEMP,DES_COUNT_TEMP,DES_DEWP,DES_COUNT_DEWP,DES_SLP,DES_COUNT_SLP,DES_STP,DES_COUNT_STP,DES_VISIB,DES_COUNT_VISIB,DES_WDSP,DES_COUNT_WDSP,DES_MXSPD,DES_GUST,DES_MAX,DES_MIN,DES_PRCP,DES_SNDP,DES_FRSHTT
0,6,WN,1671,N649SW,BWI,BDL,1010,1014.0,4.0,12.0,1026.0,70.0,67.0,49.0,283,1115.0,6.0,1120,1121.0,1.0,0,0,,,,,,,2015-03-28,BWI,724060-93721,BDL,725080-14740,35.4,24.0,17.0,24.0,1013.1,24.0,1007.3,24.0,9.9,24.0,11.6,24.0,18.1,26.0,50.0,30.0,0.01G,999.9,1000,36.0,24.0,28.2,24.0,1008.5,20.0,1001.7,24.0,6.3,24.0,7.1,24.0,13.0,19.0,45.0,30.0,0.01G,999.9,1000
1,6,EV,2509,N902EV,DFW,GGG,1010,1005.0,-5.0,19.0,1024.0,47.0,50.0,26.0,140,1050.0,5.0,1057,1055.0,-2.0,0,0,,,,,,,2015-03-28,DFW,722590-03927,GGG,722470-03901,59.6,24.0,42.1,24.0,1018.3,24.0,996.9,24.0,10.0,24.0,7.8,24.0,12.0,17.1,81.0,43.0,0.00G,999.9,0,58.3,24.0,41.3,24.0,1019.6,24.0,1006.0,24.0,10.0,24.0,5.0,24.0,11.1,19.0,81.0,37.9,0.00G,999.9,0
2,6,B6,160,N239JB,PHL,BOS,1015,1005.0,-10.0,29.0,1034.0,84.0,86.0,53.0,280,1127.0,4.0,1139,1131.0,-8.0,0,0,,,,,,,2015-03-28,PHL,724080-13739,BOS,725090-14739,39.2,24.0,19.3,24.0,1011.1,24.0,1010.1,24.0,10.0,24.0,13.4,24.0,20.0,25.1,48.0,33.1,0.20G,999.9,0,35.7,24.0,32.3,24.0,1007.3,17.0,1006.8,24.0,6.0,24.0,8.6,24.0,18.1,26.0,44.1,32.0,0.02G,999.9,111000
3,6,DL,2241,N357NB,MSP,MSY,1015,1012.0,-3.0,15.0,1027.0,160.0,132.0,113.0,1039,1220.0,4.0,1255,1224.0,-31.0,0,0,,,,,,,2015-03-28,MSP,726580-14922,MSY,722310-12916,30.0,24.0,10.6,24.0,1023.6,24.0,992.0,24.0,10.0,24.0,7.2,24.0,14.0,21.0,41.0,17.1,0.00G,999.9,0,59.8,24.0,39.8,24.0,1021.9,24.0,1021.0,24.0,10.0,24.0,6.6,24.0,15.0,999.9,70.0,50.0,0.00G,999.9,0
4,6,UA,1548,N14102,DEN,EWR,1015,1026.0,11.0,12.0,1038.0,218.0,209.0,189.0,1605,1547.0,8.0,1553,1555.0,2.0,0,0,,,,,,,2015-03-28,DEN,725650-03017,EWR,725020-14734,60.8,24.0,28.6,24.0,1012.9,24.0,833.8,24.0,10.0,24.0,9.4,24.0,21.0,27.0,79.0,39.0,0.00G,999.9,0,39.4,24.0,20.3,24.0,1009.6,24.0,1008.7,24.0,10.0,24.0,12.7,24.0,15.9,22.0,46.9,33.1,0.24G,999.9,0


In [58]:
sample.describe()

Unnamed: 0,DAY_OF_WEEK,FLIGHT_NUMBER,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY,OR_TEMP,OR_COUNT_TEMP,OR_DEWP,OR_COUNT_DEWP,OR_SLP,OR_COUNT_SLP,OR_STP,OR_COUNT_STP,OR_VISIB,OR_COUNT_VISIB,OR_WDSP,OR_COUNT_WDSP,OR_MXSPD,OR_GUST,OR_SNDP,DES_TEMP,DES_COUNT_TEMP,DES_DEWP,DES_COUNT_DEWP,DES_SLP,DES_COUNT_SLP,DES_STP,DES_COUNT_STP,DES_VISIB,DES_COUNT_VISIB,DES_WDSP,DES_COUNT_WDSP,DES_MXSPD,DES_GUST,DES_SNDP
count,149166.0,149166.0,149166.0,147761.0,147761.0,147691.0,147691.0,149166.0,147253.0,147253.0,149166.0,147609.0,147609.0,149166.0,147609.0,147253.0,149166.0,149166.0,26047.0,26047.0,26047.0,26047.0,26047.0,132807.0,132807.0,132807.0,132807.0,132807.0,132807.0,132807.0,132807.0,132807.0,132807.0,132807.0,132807.0,132807.0,132807.0,132807.0,132807.0,132807.0,132807.0,132807.0,132807.0,132807.0,132807.0,132807.0,132807.0,132807.0,132807.0,132807.0,132807.0,132807.0,132807.0
mean,3.927021,2150.663241,1329.547504,1333.429504,8.933013,15.973404,1354.905824,141.902846,136.912579,113.527371,825.978688,1464.656694,7.424669,1488.545399,1469.506724,3.683497,0.002842,0.009982,13.641494,0.094406,19.488002,23.179752,2.659577,66.950794,23.986236,59.164462,23.967946,1069.159965,22.306836,1001.48021,23.916812,10.16727,23.965552,7.377336,23.97028,14.631955,406.602137,986.702036,66.946092,23.987335,57.590153,23.973036,1066.31755,22.313959,998.718184,23.922926,9.967661,23.970702,7.342599,23.97156,14.61137,404.146772,986.649377
std,1.996409,1741.915228,486.652215,499.576713,36.774818,8.658769,501.539585,75.09414,74.090184,72.132607,609.498848,526.629591,5.501084,511.758655,530.750984,38.828019,0.053239,0.099411,28.743652,3.409084,50.418909,42.992668,18.418395,14.553775,0.313312,253.520294,0.700697,692.636508,3.153783,375.792549,1.114673,27.059379,0.736547,20.253064,0.594209,24.844711,477.16953,113.950731,14.57384,0.300127,220.556377,0.628395,674.177179,3.126204,343.613912,1.047115,23.247181,0.656355,18.927757,0.568604,23.944168,476.629104,114.17522
min,1.0,1.0,1.0,1.0,-41.0,1.0,1.0,18.0,14.0,8.0,31.0,1.0,1.0,1.0,1.0,-68.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-15.3,6.0,-19.7,0.0,973.6,0.0,746.7,0.0,0.2,0.0,0.0,0.0,2.9,12.0,1.2,-20.6,6.0,-27.2,0.0,977.2,0.0,746.7,0.0,0.5,0.0,0.0,0.0,2.9,9.9,1.2
25%,2.0,727.0,915.0,918.0,-5.0,11.0,932.0,86.0,82.0,60.0,373.0,1050.0,4.0,1105.0,1054.0,-14.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,58.0,24.0,42.0,24.0,1012.1,22.0,983.1,24.0,9.5,24.0,4.8,24.0,11.1,20.0,999.9,57.9,24.0,42.1,24.0,1012.1,22.0,983.1,24.0,9.5,24.0,4.8,24.0,11.1,20.0,999.9
50%,4.0,1682.0,1325.0,1328.0,-2.0,14.0,1340.0,123.0,118.0,94.0,650.0,1503.0,6.0,1515.0,1507.0,-5.0,0.0,0.0,1.0,0.0,2.0,3.0,0.0,69.6,24.0,55.4,24.0,1015.2,24.0,997.9,24.0,10.0,24.0,6.6,24.0,13.0,28.0,999.9,69.6,24.0,55.3,24.0,1015.2,24.0,997.7,24.0,10.0,24.0,6.6,24.0,13.0,28.0,999.9
75%,6.0,3158.0,1734.0,1741.0,7.0,19.0,1755.0,173.0,168.0,144.0,1067.0,1909.0,9.0,1916.0,1914.0,7.0,0.0,0.0,17.0,0.0,19.0,29.0,0.0,77.8,24.0,65.6,24.0,1018.8,24.0,1012.4,24.0,10.0,24.0,8.7,24.0,17.1,999.9,999.9,77.9,24.0,65.7,24.0,1018.8,24.0,1012.4,24.0,10.0,24.0,8.7,24.0,17.1,999.9,999.9
max,7.0,7438.0,2359.0,2400.0,1544.0,160.0,2400.0,680.0,691.0,667.0,4983.0,2400.0,169.0,2359.0,2400.0,1528.0,1.0,1.0,674.0,440.0,1528.0,1057.0,712.0,99.8,24.0,9999.9,24.0,9999.9,24.0,9999.9,24.0,999.9,24.0,999.9,24.0,999.9,999.9,999.9,99.8,24.0,9999.9,24.0,9999.9,24.0,9999.9,24.0,999.9,24.0,999.9,24.0,999.9,999.9,999.9


We need to replace some values in our sample. A value of 999.9~ means that there was not a recorded value for that particular day. So, for precipitation we will replace non-recordings with 0, and for the other values such as wind speed and temperature it makes more sense to replace them with the mean value of the values which aren't 999.9~

In [67]:
# remove the tag on the end of the precip values
# these could be useful in later analysis, but for
# now we simply want to take the values at face... value.
'''
sample['OR_PRCP'].apply(lambda x: x[:-1])
sample['DEST_PRCP'].apply(lambda x: x[:-1])
'''

for col in sample.columns:
    if col == 'OR_PRCP' or col == 'DEST_PRCP':
        sample[col] = sample[col].replace('999.0','0')
    #print(col)
sample['OR_PRCP'].head(1000)

0      0.01G
1      0.00G
2      0.20G
3      0.00G
4      0.00G
5      0.14G
6      0.01G
7      0.28G
8      0.00G
9      0.01G
10     0.06G
11     0.00G
12     0.00G
13     0.01G
14     0.02A
15     0.00G
16     0.01G
17     0.00G
18     0.27G
19     0.27G
20     0.00G
21     0.01G
22     0.00G
23     0.00G
24     0.00G
25     0.03G
26     0.00G
27     0.01G
28     0.00I
29     0.00G
       ...  
970    0.01G
971    0.00G
972    0.00G
973    0.00G
974    0.00G
975    0.00G
976    0.20G
977    0.00G
978    0.00G
979    0.00G
980    0.20G
981    0.02G
982    0.00G
983    0.20G
984    0.02G
985    0.00G
986    0.00G
987    0.00G
988    0.00G
989    0.00G
990    0.00G
991    0.00G
992    0.20G
993    0.00G
994    0.00G
995    0.00G
996    0.00G
997    0.00I
998    0.00G
999    0.00G
Name: OR_PRCP, Length: 1000, dtype: object