# Milestone 2: EDA 
--- 

## Loading in necessary imports:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns

## Reading, Cleaning, and Almagimating My Dataset:

I wanted to group the different cities that are in a similar climate zone. I decided the most sensical way to sort the data was to have these different columns heads: 

- British Columbia 
- The Prairie Provinces
- Ontario 
- Quebec 
- The Atlantic Provinces 
- The Northern Provinces

I used method chaining to form these functions and then added them to the scripts file located in my groups analysis repository so everyone could do the same thing. 

In [2]:
def load_and_process(url_or_path_to_csv_file):
    # Method Chain 1 (Load data, deal with missing data, and making data readable)
    df1 = (
        pd.read_csv(url_or_path_to_csv_file)
        .rename(columns ={"LOCAL_DATE":"DATE"})
        .assign(DATE = pd.to_datetime(df['DATE'], yearfirst = True).dt.date)
        .sort_values("DATE", ascending=False)
        .dropna()
        .reset_index(drop = True)
         )
    return df1

In [5]:
df = load_and_process('Canadian_climate_history (1970-2020).csv')
df

NameError: name 'df' is not defined

In [None]:
#Method Chain 2 (Creating new columns for different regions of Canada)
def group_columns(df):
    df2 = (
        #create new columns that take mean temperature and percipitation from Atlantic Provinces 
        df
        .assign(TEMPERATURE_ATLANTIC = (df.iloc[:, [5, 7, 17]].mean(axis=1)).round(decimals=1))
        .assign(PERCIPITATION_ATLANTIC = (df.iloc[:, [6, 8, 18]].mean(axis=1)).round(decimals=1))
        
        #create new columns that take mean temperature and percipitation from Prairie provinces 
        .assign(TEMPERATURE_PRAIRIES = (df.iloc[:, [1, 3, 15, 25]].mean(axis=1)).round(decimals=1))
        .assign(PERCIPITATION_PRAIRIES = (df.iloc[:, [2, 4, 16, 26]].mean(axis=1)).round(decimals=1))
        
        #create new columns that take mean temperature and percipitation from cities in Ontario and merge into single column
        .assign(TEMPERATURE_ONTARIO = (df.iloc[:, [11, 19]].mean(axis=1)).round(decimals=1))
        .assign(PERCIPITATION_ONTARIO = (df.iloc[:, [12, 20]].mean(axis=1)).round(decimals=1))
        
         #create new columns that take mean temperature and percipitation from cities in Quebec and merge into single column
        .assign(TEMPERATURE_QUEBEC = (df.iloc[:, [9, 13]].mean(axis=1)).round(decimals=1))
        .assign(PERCIPITATION_QUEBEC = (df.iloc[:, [10, 14]].mean(axis=1)).round(decimals=1))
        
        #dropping columns that were amalgimated into the means
        .drop(columns = ['MEAN_TEMPERATURE_CALGARY', 'TOTAL_PRECIPITATION_CALGARY', 'MEAN_TEMPERATURE_EDMONTON', 'TOTAL_PRECIPITATION_EDMONTON', 
                         'MEAN_TEMPERATURE_HALIFAX', 'TOTAL_PRECIPITATION_HALIFAX', 'MEAN_TEMPERATURE_MONCTON', 'TOTAL_PRECIPITATION_MONCTON',
                        'MEAN_TEMPERATURE_SASKATOON', 'TOTAL_PRECIPITATION_SASKATOON', 'MEAN_TEMPERATURE_STJOHNS', 'TOTAL_PRECIPITATION_STJOHNS',
                        'MEAN_TEMPERATURE_WINNIPEG', 'TOTAL_PRECIPITATION_WINNIPEG', 'MEAN_TEMPERATURE_OTTAWA', 'TOTAL_PRECIPITATION_OTTAWA',
                        'MEAN_TEMPERATURE_TORONTO', 'TOTAL_PRECIPITATION_TORONTO', 'MEAN_TEMPERATURE_MONTREAL', 'TOTAL_PRECIPITATION_MONTREAL',
                        'MEAN_TEMPERATURE_QUEBEC', 'TOTAL_PRECIPITATION_QUEBEC'])
        
        #renaming columns to meet new location based naming
        .rename(columns ={"MEAN_TEMPERATURE_VANCOUVER":"TEMPERATURE_BRITISH_COLUMBIA"})
        .rename(columns ={"TOTAL_PRECIPITATION_VANCOUVER":"PRECIPITATION_BRITISH_COLUMBIA"})
        .rename(columns ={"MEAN_TEMPERATURE_WHITEHORSE":"TEMPERATURE_NORTHERN"})
        .rename(columns ={"TOTAL_PRECIPITATION_WHITEHORSE":"PRECIPITATION_NORTHERN"})
    )
        
    return df2

In [None]:
df2 = group_columns(df)

In [None]:
df2

## Using pandas_profiling 
---
I wanted to use pandas profiling to get a basic idea of different correlation and information about my dataset 

In [None]:
import pandas_profiling as pdp
dfprofile = pdp.ProfileReport(df2)

In [None]:
dfprofile

## sns.pair plot 
---
I ran this function not known how long it would take. However, it shows a lot of good graphs and I think it is a good starting point of my exploration. I can see some definite correlations. I put the # symbol there so it doesn't automatically run because it takes too long. 

In [None]:
#sns.pairplot(df)

# Visualizations 
---
## Looking at the data types 
When I was first trying to make graphs it was difficult because I wasn't aware that the date object must be on the x-axis in the plot. I used .dtype to make sure I could plot everything in my dataset. 

In [None]:
df2.dtypes

## Making some displots 

In an attempt to visualize I wanted to make some displots to see there was an upward trend in temperatures across Canada throughout time. 

#### British Columbia:

In [None]:
sns.displot(data = df2,
           y = "TEMPERATURE_BRITISH_COLUMBIA",
           x = "DATE")

#### Northern Provinces: 

In [None]:
sns.displot(data = df2,
           y = "TEMPERATURE_NORTHERN",
           x = "DATE")

#### Prairie Provinces: 

In [None]:
sns.displot(data = df2,
           y = "TEMPERATURE_PRAIRIES",
           x = "DATE")

#### Ontario: 

In [None]:
sns.displot(data = df2,
           y = "TEMPERATURE_ONTARIO",
           x = "DATE")

#### Quebec:

In [None]:
sns.displot(data = df2,
           y = "TEMPERATURE_QUEBEC",
           x = "DATE")

#### Atlantic Provinces:

In [None]:
sns.displot(data = df2,
           y = "TEMPERATURE_ATLANTIC",
           x = "DATE")

### My notes: 

After making these it is clear that when initially cleaning the data and dropping null value it got rid of a lot of the data points prior to 1985. The graphs I made are pretty busy and I'm not sure if they are the best way to visualize the data. 

As the graphs are now there is a lot of weight in the mid-range temperature throughout most the graph. This makes sense as the current graphs are based on how many time a certain temperature is recorded throughout the year. I think a better approach might be to subdivide the dataset by the different months and monitor if certain months have been changing overtime. I think this might be better as it will more clearly show change and there will not be so many data points on the plots.

## Subdividing the dataframe: 
---