# Intro

## Data Details

### Columns
'DATE, REPORT_TYPE, SOURCE, AWND, CDSD, CLDD, DSNW, DYHF, DYTS, DailyAverageDryBulbTemperature, DailyAverageStationPressure, DailyAverageWindSpeed, DailyCoolingDegreeDays, DailyDepartureFromNormalAverageTemperature, DailyHeatingDegreeDays, DailyMaximumDryBulbTemperature, DailyMinimumDryBulbTemperature, DailyPeakWindDirection, DailyPeakWindSpeed, DailyPrecipitation, DailySnowDepth, DailySnowfall, DailySustainedWindDirection, DailySustainedWindSpeed, DailyWeather, HDSD, HTDD, HourlyAltimeterSetting, HourlyDewPointTemperature, HourlyDryBulbTemperature, HourlyPrecipitation, HourlyPresentWeatherType, HourlyPressureChange, HourlyPressureTendency, HourlyRelativeHumidity, HourlySeaLevelPressure, HourlySkyConditions, HourlyStationPressure, HourlyVisibility, HourlyWetBulbTemperature, HourlyWindDirection, HourlyWindGustSpeed, HourlyWindSpeed, MonthlyDaysWithGT001Precip, MonthlyDaysWithGT010Precip, MonthlyDaysWithGT32Temp, MonthlyDaysWithGT90Temp, MonthlyDaysWithLT0Temp, MonthlyDaysWithLT32Temp, MonthlyDepartureFromNormalAverageTemperature, MonthlyDepartureFromNormalCoolingDegreeDays, MonthlyDepartureFromNormalHeatingDegreeDays, MonthlyDepartureFromNormalMaximumTemperature, MonthlyDepartureFromNormalMinimumTemperature, MonthlyDepartureFromNormalPrecipitation, MonthlyGreatestPrecip, MonthlyGreatestPrecipDate, MonthlyMaxSeaLevelPressureValue, MonthlyMaxSeaLevelPressureValueDate, MonthlyMaxSeaLevelPressureValueTime, MonthlyMaximumTemperature, MonthlyMeanTemperature, MonthlyMinSeaLevelPressureValue, MonthlyMinSeaLevelPressureValueDate, MonthlyMinSeaLevelPressureValueTime, MonthlyMinimumTemperature, MonthlySeaLevelPressure, MonthlyStationPressure, MonthlyTotalLiquidPrecipitation, NormalsCoolingDegreeDay, NormalsHeatingDegreeDay, REM, REPORT_TYPE.1, SOURCE.1, ShortDurationEndDate005, ShortDurationEndDate010, ShortDurationEndDate015, ShortDurationEndDate020, ShortDurationEndDate030, ShortDurationEndDate045, ShortDurationEndDate060, ShortDurationEndDate080, ShortDurationEndDate100, ShortDurationEndDate120, ShortDurationEndDate150, ShortDurationEndDate180, ShortDurationPrecipitationValue005, ShortDurationPrecipitationValue010, ShortDurationPrecipitationValue015, ShortDurationPrecipitationValue020, ShortDurationPrecipitationValue030, ShortDurationPrecipitationValue045, ShortDurationPrecipitationValue060, ShortDurationPrecipitationValue080, ShortDurationPrecipitationValue100, ShortDurationPrecipitationValue120, ShortDurationPrecipitationValue150, ShortDurationPrecipitationValue180, Sunrise, Sunset, MonthlyInd'

## Analysis Code

In [135]:
import pandas as pd
import numpy as np
from plotnine import *

pd.set_option('display.max_columns', None)


In [136]:
def DropNaCols(df):
    '''given a dataframe, drop all the columns with nothing but NaN'''
    naColList = []
    for ele in df.columns:
        uniqueVals = list(df[ele].unique())
        if len(uniqueVals) == 1:
            naColList.append(ele)
    return(df.drop(columns=naColList))

In [137]:
def ReadingType(df):
    '''indicates whether the given row provides a monthly, daily, or hourly reading'''
    
    #initial reading type
    df['MonthlyInd'] = np.where(df['MonthlyMeanTemperature'].isnull(),0,1)
    df['DailyInd'] = np.where(df['DailyMaximumDryBulbTemperature'].isnull(),0,1)
    df['HourlyInd'] = np.where(df['HourlyDryBulbTemperature'].isnull(),0,1)

    #combine into a single reading column
    readingType = []
    for ele in range(df.shape[0]):
        if df['MonthlyInd'][ele] == 1:
            readingType.append('Monthly')
        elif df['DailyInd'][ele] == 1:
            readingType.append('Daily')
        else: readingType.append('Hourly')
    df['ReadingType'] = readingType
    df = df.drop(columns=['MonthlyInd','DailyInd','HourlyInd'])
    return(df)

In [138]:
def SplitDataframes(df):
    '''given a dataframe of labeled monthly, daily, and hourly readings, split the dataframe by those labels into component dataframes and load those to a dictionary labeled according to reading type'''
    dfDict = {}
    for ele in df['ReadingType'].unique():
        dfDict[ele] = DropNaCols(df[df['ReadingType'] == ele]).reset_index(drop=True)
    return dfDict

In [139]:
#read in data
df = pd.read_csv('3063831.csv')

#clean up data
df = DropNaCols(df)

#add reading type column
df = ReadingType(df)

#next, put dataframe into separate dataframes depending on the type of reading
dfDict = SplitDataframes(df)



In [141]:
dfDict.keys()

dict_keys(['Hourly', 'Daily', 'Monthly'])

for every row, build an indicator column telling whether the reading is hourly, daily, monthly, or none of the above. To do so, check if any of the monthly columns are not nan. If that's true, then it's monthly. If not true, check if any of the daily columns are not nan. If that's true, then it's daily. If not true, check if any of the hourly columns are not nan. If that's true, then it's hourly. If that's not true, then the row should be deleted

In [None]:
list(df.iloc[0].index)

In [170]:
'MONTH'.lower()

'month'

In [179]:
monthlyCols = [ele for ele in list(df.columns) if 'month' in ele.lower()]
dailyCols = [ele for ele in list(df.columns) if 'dai' in ele.lower()]
hourlyCols = [ele for ele in list(df.columns) if 'hour' in ele.lower()]

∂∂∂



In [186]:
boolMask = df[dailyCols].isna()
boolMask['sum'] = boolMask.sum(axis=1)

In [187]:
boolMask

Unnamed: 0,DailyAverageDryBulbTemperature,DailyAverageStationPressure,DailyAverageWindSpeed,DailyCoolingDegreeDays,DailyDepartureFromNormalAverageTemperature,DailyHeatingDegreeDays,DailyMaximumDryBulbTemperature,DailyMinimumDryBulbTemperature,DailyPeakWindDirection,DailyPeakWindSpeed,DailyPrecipitation,DailySnowDepth,DailySnowfall,DailySustainedWindDirection,DailySustainedWindSpeed,DailyWeather,sum
0,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,16
1,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,16
2,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,16
3,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,16
4,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
43448,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,16
43449,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,16
43450,False,False,False,False,False,False,False,False,False,False,False,True,True,False,False,False,2
43451,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,16


In [178]:
[ele for ele in list(df.columns) if 'month' in ele.lower()]

['MonthlyDaysWithGT001Precip',
 'MonthlyDaysWithGT010Precip',
 'MonthlyDaysWithGT32Temp',
 'MonthlyDaysWithGT90Temp',
 'MonthlyDaysWithLT0Temp',
 'MonthlyDaysWithLT32Temp',
 'MonthlyDepartureFromNormalAverageTemperature',
 'MonthlyDepartureFromNormalCoolingDegreeDays',
 'MonthlyDepartureFromNormalHeatingDegreeDays',
 'MonthlyDepartureFromNormalMaximumTemperature',
 'MonthlyDepartureFromNormalMinimumTemperature',
 'MonthlyDepartureFromNormalPrecipitation',
 'MonthlyGreatestPrecip',
 'MonthlyGreatestPrecipDate',
 'MonthlyMaxSeaLevelPressureValue',
 'MonthlyMaxSeaLevelPressureValueDate',
 'MonthlyMaxSeaLevelPressureValueTime',
 'MonthlyMaximumTemperature',
 'MonthlyMeanTemperature',
 'MonthlyMinSeaLevelPressureValue',
 'MonthlyMinSeaLevelPressureValueDate',
 'MonthlyMinSeaLevelPressureValueTime',
 'MonthlyMinimumTemperature',
 'MonthlySeaLevelPressure',
 'MonthlyStationPressure',
 'MonthlyTotalLiquidPrecipitation']