# Parse CSV Weather Data into Python

## Task 1

Write a python program to parse the CSV files. Read the data from the CSV files into an appropriate python data structure. For the file handling part of this task you should NOT use an existing CSV reading and parsing library such as PANDAS. Instead you should code the capability from scratch. You are recommended to structure your code to allow re-use of the core parser as you may find it useful in the future. You may use libraries including PANDAS once you have read the data, including to convert datatypes from their raw input type.
Your code should correctly parse the CSV file provided. If a file is read that does not conform to the variant of the CSV file provided your program should not generate any unhandled exceptions.
Compute the minimum, maximum, mean and standard deviation for each component of the weather data and report this. In task 2 you may wish to account for gaps or other issues in the data provided, but in task 1 you should compute these values for all the data provided.
Using either a suitable python library to plot pairs of the data or Microsoft Excel, visually check for correlations between the weather components.

In [1]:
import numpy as np
# importing only numpy for the read csv task

In [2]:
# Custom funtion that takes filename as input, reads the contents and parses it to be returned and further processed

def my_csv_reader(filename):

    try:
        with open(filename, 'r', encoding='utf-8-sig') as f:
            if filename[-3:]!='csv':
                raise 
            data_parsed = []
            for line in f:
                elements = line.replace('"','').strip().split(',')
                data_parsed.append(elements)
        return data_parsed
    except:
        print('The file is not a csv or is corrupted')
        return


    

In [3]:
#Calling the custom csv reader function on the four data files given to us

barometer_data = np.array(my_csv_reader('barometer-last-year.csv'))
indoor_temp = np.array(my_csv_reader('indoor-temperature-last-year.csv'))
outdoor_temp = np.array(my_csv_reader('outside-temperature-last-year.csv'))
rainfall_data = np.array(my_csv_reader('rainfall-last-year.csv'))

The file is not a csv or is corrupted
The file is not a csv or is corrupted
The file is not a csv or is corrupted
The file is not a csv or is corrupted


In [4]:
# Only here am I importing pandas and the remaining dependencies required for the rest of the Tasks

import pandas as pd
from functools import reduce
import seaborn as sns
import matplotlib as plt

In [5]:
# Creating dataframes of the four csv files that have been read and parsed. We store them in four separate data frames for the time being

# the first entry in each parsed list contains the name, and the remaining contain the data. So the data has been read accordingly
df_barometer = pd.DataFrame(data=barometer_data[1:,:], columns=barometer_data[0,:]) 
df_indoor = pd.DataFrame(data=indoor_temp[1:,:], columns=indoor_temp[0,:])
df_outdoor = pd.DataFrame(data=outdoor_temp[1:,:], columns=outdoor_temp[0,:])
df_rainfall = pd.DataFrame(data=rainfall_data[1:,:], columns=rainfall_data[0,:])

IndexError: too many indices for array: array is 0-dimensional, but 2 were indexed

In [None]:
# A sample of how one data frame looks like

df_indoor.head(5)

In [None]:
# Here, we merge all four data frames, on the DateTime column. We also rename the various termperature column names to something that makes more sense!

to_merge = [df_barometer, df_indoor, df_outdoor, df_rainfall]
df_merged = reduce(lambda  l,r: pd.merge(l,r,on=['DateTime'],
                                            how='outer'), to_merge)

df_merged.rename(columns = {'Temperature_x':'Temperature indoor', 
                            'Temperature_range (low)_x':'Temperature_range (low) indoor',
                            'Temperature_range (high)_x':'Temperature_range (high) indoor',
                            'Temperature_y':'Temperature outdoor', 
                            'Temperature_range (low)_y':'Temperature_range (low) outdoor',
                            'Temperature_range (high)_y':'Temperature_range (high) outdoor'}, inplace=True)

In [None]:
# Previewing the merged dataframe. This becomes our main dataframe for the rest of this notebook

df_merged.head()

## Helper functions

These functions are used to preprocess and clean up our main Data Frame, by converting data to appropriate data types, as well as getting the respective statistical summaries for each column

In [None]:

# Converting DateTime column data to a datetime object
def convert_date(df):
    df['DateTime'] = pd.to_datetime(df['DateTime'], format="%Y/%m/%d %H:%M:%S")
    return df

# Converting all columns with numnerical data to float type
def convert_float(df):
    for col in df.columns[1:]:
        df[col] = pd.to_numeric(df[col], downcast='float')
    return df

# This function calculates the max, mean, min and standard deviation for each column, and stores it in a dictionary
def get_stats(df):
    results = dict()
    for col in df.columns[1:]:
        t = dict()
        t['Maximum'] = round(np.max(df[col]),2)
        t['Minimum'] = round(np.min(df[col]),2)
        t['Mean'] = round(np.mean(df[col]),2)
        t['Standard Deviation'] = round(np.std(df[col]),2)
        results[col] = t
    return results



In [None]:
# Viewing the cleaned dataframe

df_merged = convert_date(df_merged)
df_clean = convert_float(df_merged)
df_clean.head(5)

In [None]:
# As part of task 1, this is where we print out the required values for each weather component

stats = get_stats(df_clean)
print("    COMPONENT-WISE STATISTICS: \n=========================================================")
for s in stats:
    print("{}: Mean = {} || Maximum = {} || Minimum = {} || Standard Deviation = {}".format(s,stats[s]['Mean'],stats[s]['Maximum'],stats[s]['Minimum'],
                                                                                                                        stats[s]['Standard Deviation']))
    print("------")

## Visualizing the correlations between components

In [None]:
# Storing different colours in a list
colors = iter(['xkcd:red purple', 'xkcd:pale teal', 'xkcd:warm purple',
       'xkcd:light forest green', 'xkcd:blue with a hint of purple',
       'xkcd:light peach', 'xkcd:dusky purple', 'xkcd:pale mauve',
       'xkcd:bright sky blue', 'xkcd:baby poop green', 'xkcd:brownish',
       'xkcd:moss green', 'xkcd:deep blue', 'xkcd:melon',
       'xkcd:faded green', 'xkcd:cyan', 'xkcd:brown green',
       'xkcd:purple blue', 'xkcd:baby shit green', 'xkcd:greyish blue']*5) 

# Creating separate functions for scatter and histogram for our overall correlation plot
def my_scatter(x,y, **kwargs):
    kwargs['color'] = next(colors)
    plt.pyplot.scatter(x,y, **kwargs)

def my_hist(x, **kwargs):
    kwargs['color'] = next(colors)
    plt.pyplot.hist(x, **kwargs)

g = sns.PairGrid(df_clean)
g.map_diag(my_hist)
g.map_offdiag(my_scatter)




In [None]:
# Viewing the correlation as a heatmap as well
sns.heatmap(df_clean.corr(), annot=True)
plt.pyplot.show()

## Task 2

Extend your code to summarise the data into monthly and seasonable values. From this summary data construct two separate infographics. One using the monthly values and one using the seasonal values. These should convey the
            
summarised data for all the weather components concisely and clearly. Provide a brief report (one page) than justifies the visual mappings you’ve used.

In [None]:
# Creating monthly dataframe by first creating a new column, storing the month for each entry, and then grouping them according to the month

df_clean['month'] = pd.DatetimeIndex(df_clean['DateTime']).month
df_month = df_clean.groupby('month')
# df_1.aggregate({'Baro': np.mean})
df_month.describe()

In [6]:
# Viewing the month dataframe
df_month.head(10)

NameError: name 'df_month' is not defined

The next few cells contain various components of the monthly data, which will be used for the infographic. Feel free to skip below to the seasonal code

In [7]:
df_month[['Humidity']].mean()

NameError: name 'df_month' is not defined

In [8]:
df_month[['Humidity']].max()

NameError: name 'df_month' is not defined

In [9]:
df_month[['Humidity']].min()

NameError: name 'df_month' is not defined

In [10]:
df_month[['mm']].mean().round(decimals=3)

NameError: name 'df_month' is not defined

In [11]:
df_month[['mm']].max().round(decimals=3)

NameError: name 'df_month' is not defined

In [12]:
df_month[['mm']].min().round(decimals=3)

NameError: name 'df_month' is not defined

In [13]:
df_month[['Baro']].mean().round(decimals=3)

NameError: name 'df_month' is not defined

In [14]:
df_month['Temperature_range (high) outdoor'].max() - df_month['Temperature_range (low) outdoor'].min()

NameError: name 'df_month' is not defined

In [15]:
df_month['Temperature_range (low) indoor'].min()

NameError: name 'df_month' is not defined

In [16]:
df_month['Temperature_range (high) indoor'].max()

NameError: name 'df_month' is not defined

In [17]:
# Function to categorize each data point to a certain season based on the month it belongs to

def month_to_season(month):
    spring = list(range(3, 5+1))
    summer = list(range(6, 8+1))
    fall = list(range(9, 11+1))
    winter = [12, 1, 2]
    if month in spring:
        return 'Spring'
    elif month in summer:
        return 'Summer'
    elif month in fall:
        return 'Fall'
    else:
        return 'Winter'

In [18]:
# Creating a season column and the using that column to group the data and creat a seasons data frame.

df_clean['season'] = df_clean['month'].apply(month_to_season)
df_clean['season'] = df_clean.apply(lambda x: x['season']+str(pd.Timestamp(x['DateTime']).year)[-2:] if x['season']=='Fall' else x['season'],axis=1)
df_season = df_clean.groupby('season')

df_season.describe()

NameError: name 'df_clean' is not defined

he next few cells contain various components of the seasonal data, which will be used for the infographic. Feel free to skip 

In [19]:
df_season[['Humidity']].mean()

NameError: name 'df_season' is not defined

In [20]:
df_season[['Humidity']].max()

NameError: name 'df_season' is not defined

In [21]:
df_season[['Humidity']].min()

NameError: name 'df_season' is not defined

In [22]:
df_season[['Baro']].mean().round(decimals=3)

NameError: name 'df_season' is not defined

In [23]:
df_season['Temperature_range (low) outdoor'].min()

NameError: name 'df_season' is not defined

In [24]:
df_season['Temperature_range (high) outdoor'].max()

NameError: name 'df_season' is not defined

In [25]:
df_season[['mm']].mean().round(decimals=3)

NameError: name 'df_season' is not defined