#### 1. Data Dictionary
<h4> Below is a data dictionary describing what each column in the dataframe entitles. The columns will be analyzed and determined if any null values, measurement changes, imputing, removal, dropping, combining, or feature engineering are required in order for the data to be as clean and accurate as possible.</h4>

<h4>1.City and date indicators</h4><ul>
   <li><b>city</b> – City abbreviations: sj for San Juan and iq for Iquitos</li>
   <li><b>week_start_date</b> – Date given in yyyy-mm-dd format</li>

<h4>2.NOAA's GHCN daily climate data weather station measurements</h4>
   <li><b>station_max_temp_c</b> – Maximum temperature</li>
   <li><b>station_min_temp_c</b> – Minimum temperature</li>
   <li><b>station_avg_temp_c</b> – Average temperature</li>
   <li><b>station_precip_mm</b> – Total precipitation</li>
   <li><b> station_diur_temp_rng_c</b> – Diurnal temperature range

<h4>3.PERSIANN satellite precipitation measurements (0.25x0.25 degree scale)</h4>
  <li><b> precipitation_amt_mm</b> – Total precipitation</li>

<h4>4.NOAA's NCEP Climate Forecast System Reanalysis measurements (0.5x0.5 degree scale)</h4>
  <li><b> reanalysis_sat_precip_amt_mm</b> – Total precipitation</li>
  <li><b> reanalysis_dew_point_temp_k</b>– Mean dew point temperature</li>
  <li><b> reanalysis_air_temp_k</b> – Mean air temperature</li>
  <li><b> reanalysis_relative_humidity_percent</b> – Mean relative humidity</li>
  <li><b> reanalysis_specific_humidity_g_per_kg</b> – Mean specific humidity</li>
  <li><b> reanalysis_precip_amt_kg_per_m2</b> – Total precipitation</li>
  <li><b> reanalysis_max_air_temp_k</b> – Maximum air temperature</li>
  <li><b> reanalysis_min_air_temp_k</b> – Minimum air temperature</li>
  <li><b> reanalysis_avg_temp_k</b> – Average air temperature</li>
  <li><b> reanalysis_tdtr_k</b> – Diurnal temperature range</li>

<h4>5.Satellite vegetation - Normalized difference vegetation index (NDVI) - NOAA's CDR Normalized Difference Vegetation Index (0.5x0.5 degree scale) measurements</h4>
  <li><b>ndvi_se</b> – Pixel southeast of city centroid</li>
  <li><b>ndvi_sw</b> – Pixel southwest of city centroid</li>
  <li><b>ndvi_ne</b> – Pixel northeast of city centroid</li>
  <li><b>ndvi_nw</b> – Pixel northwest of city centroid</li></ul>

In [2]:
# These are the libraries used for this project.
import pandas as pd
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt 
import statsmodels.api as sm
from matplotlib import pyplot
%matplotlib inline

from scipy.stats import mode
from datetime import datetime
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, cross_val_predict
from sklearn.svm import LinearSVR, SVR
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import make_scorer, mean_absolute_error
from sklearn.preprocessing import LabelEncoder, StandardScaler, Imputer, MinMaxScaler
from statsmodels.tsa.seasonal import seasonal_decompose

#### 2. Data collection
<h5>The data for this project came from DrivenData.org. It came as three .csv files. Two were for training set and one was the test set. Below is a segment of the training set</h5>

In [13]:
# training sets to load data into notebook
dengue = pd.read_csv('dengue_features_train.csv')
dengue_labels = pd.read_csv ('dengue_labels_train.csv')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [14]:
dengue.head(10)

Unnamed: 0,city,year,weekofyear,week_start_date,ndvi_ne,ndvi_nw,ndvi_se,ndvi_sw,precipitation_amt_mm,reanalysis_air_temp_k,reanalysis_avg_temp_k,reanalysis_dew_point_temp_k,reanalysis_max_air_temp_k,reanalysis_min_air_temp_k,reanalysis_precip_amt_kg_per_m2,reanalysis_relative_humidity_percent,reanalysis_sat_precip_amt_mm,reanalysis_specific_humidity_g_per_kg,reanalysis_tdtr_k,station_avg_temp_c,station_diur_temp_rng_c,station_max_temp_c,station_min_temp_c,station_precip_mm
0,sj,1990,18,4/30/1990,0.1226,0.103725,0.198483,0.177617,12.42,297.572857,297.742857,292.414286,299.8,295.9,32.0,73.365714,12.42,14.012857,2.628571,25.442857,6.9,29.4,20.0,16.0
1,sj,1990,19,5/7/1990,0.1699,0.142175,0.162357,0.155486,22.82,298.211429,298.442857,293.951429,300.9,296.4,17.94,77.368571,22.82,15.372857,2.371429,26.714286,6.371429,31.7,22.2,8.6
2,sj,1990,20,5/14/1990,0.03225,0.172967,0.1572,0.170843,34.54,298.781429,298.878571,295.434286,300.5,297.3,26.1,82.052857,34.54,16.848571,2.3,26.714286,6.485714,32.2,22.8,41.4
3,sj,1990,21,5/21/1990,0.128633,0.245067,0.227557,0.235886,15.36,298.987143,299.228571,295.31,301.4,297.0,13.9,80.337143,15.36,16.672857,2.428571,27.471429,6.771429,33.3,23.3,4.0
4,sj,1990,22,5/28/1990,0.1962,0.2622,0.2512,0.24734,7.52,299.518571,299.664286,295.821429,301.9,297.5,12.2,80.46,7.52,17.21,3.014286,28.942857,9.371429,35.0,23.9,5.8
5,sj,1990,23,6/4/1990,,0.17485,0.254314,0.181743,9.58,299.63,299.764286,295.851429,302.4,298.1,26.49,79.891429,9.58,17.212857,2.1,28.114286,6.942857,34.4,23.9,39.1
6,sj,1990,24,6/11/1990,0.1129,0.0928,0.205071,0.210271,3.48,299.207143,299.221429,295.865714,301.3,297.7,38.6,82.0,3.48,17.234286,2.042857,27.414286,6.771429,32.2,23.3,29.7
7,sj,1990,25,6/18/1990,0.0725,0.0725,0.151471,0.133029,151.12,299.591429,299.528571,296.531429,300.6,298.4,30.0,83.375714,151.12,17.977143,1.571429,28.371429,7.685714,33.9,22.8,21.1
8,sj,1990,26,6/25/1990,0.10245,0.146175,0.125571,0.1236,19.32,299.578571,299.557143,296.378571,302.1,297.7,37.51,82.768571,19.32,17.79,1.885714,28.328571,7.385714,33.9,22.8,21.1
9,sj,1990,27,7/2/1990,,0.12155,0.160683,0.202567,14.41,300.154286,300.278571,296.651429,302.3,298.7,28.4,81.281429,14.41,18.071429,2.014286,28.328571,6.514286,33.9,24.4,1.1


<h5> The two training sets were merged together. </h5>
<h5> Initially I thought to do a classification model on the data, and dummizied the city column, but after further analysis of the data I decided not to do this becasue it wouldn't have made a very good choice in model selection since there are only two cities and the cities aren't the target ,converting them to 1s and 0s won't help in determining the target goal (which is predicting the total cases for each city), and there is more important data to look at. So I decided instead to go with regression models and time series analysis</h5>

#### 3. Feature engineering
<h5> I made a judgement call to convert all the measurements from metrics to the US Standard system. I know in the end, once the data is normalizaled through scaling it doesn't matter but I had very specfic reasons for doing these converstions. I wanted to have the data in a more understandable format for myself. I'm going to be preforming deep analytical and stastical anaylais on it and I want it to be in measuremnets that I can comprehend. I also, wanted to show that I understand the concepts behind creating codes through definitions and how to convert different measurments.</h5>

In [15]:
# Example of code I wrote to convert Celsius to Fahrenheit
def celsius_to_fahren(temp_celsius):
    temp_fahren = (temp_celsius *1.8) + 32
    return temp_fahren

In [16]:
# convert mm to inches
def mm_to_inch(mm):
    inch = mm/25.4
    return inch

In [17]:
# converted Kelvin to Fahrenheit
def kelvin_to_fahren(temp_kelvin):
    temp_fahren = (temp_kelvin *1.8) - 459.67
    return temp_fahren  

In [18]:
# Change date from object to datetime
dengue['week_start_date']=pd.to_datetime(dengue.week_start_date)

In [19]:
# kilogram per square meter to pound per square inch
def kgm2_to_lbin2(kg_per_m2):
    lb_per_in2 = (kg_per_m2 *.00142233)
    return lb_per_in2

In [20]:
# gram per kilogram to grains per pound
def gperkg_to_grperlb(g_per_kg):
    gr_per_lb = (g_per_kg * 7)
    return gr_per_lb

#### 4. Exploratory Data Analysis
<h5> There seem to be a number of null values that need to be explored. Most of the data seem to be int or floats which is good. Week start date is an object, not sure if that data type will have to be changed. The week of the year is an int so there maybe no need to, but I'll have to remember that this may cause an issue later on.</h5>
<h5> Performed a heatmap analysis on the data to better understand the correlations. First looking at the heatmap in general, there are some very strong correlations between reanalysis specific humidity and reanalysis of dew point also between reanalysis saturated precipitation amount and preceipitation amount which leads me to believe there is eitehr mutlcolinearity here or these columns are duplicate values. Theses will be examined to determine if either assumption is true. There are others that have high correlations but I need to do more research on exactly what the data means before I can draw further conclusions. That will be flushed out in the EDA and further analysis section if necessary.
The target for this project is total cases. There aren't very strong correlations with any feature. Which may work in my favor. Too high of correlation can be due to multicolinarity or that the features share a common identifier in common. At this point, it's too early to start discounting features yet. A few I do know, based on biology and epidemology, is that mosquitos grow the best in warm wets climates and can survive to temperatures as low as of 40 fahrenheit. So temperature, amount of rain, and in particular amount of rain during certain seasons such as spring and summer should be examined closely.</h5>
<h5> There was one column that had 194 null values. I could imputer or at least fillna but I don't see the benefit. I think having 0 would be more determinal to the data then keeping it. The data is spread over multiple years and seasons. I don't know what year or week each indexed row is. In order to accurately imputer the needed information I would have to go by year and week and imputer based on the seasonality. I think there comes a point where the amount of work to create data that for some unknown reason was never recorded becasue specious and a danger to the integrity of the the entire dataset. Due to these reasons I will be dropping these 10 rows.
* Update * These rows can't be dropped. Upon submission of my data to Drivendata.org, my submission was rejected due to invalid amount of rows. So since I have to do to my test set what I did to my train set, I can't drop the rows. Instead I will do a fillna and to these rows.</h5>
<h5> There was also 3 columns that had to do with vegetation growth from different points of the region. There were a number of rows missing and I had to figure out how to deal with the null values. Two issues with imputing ndvi_nw,se,sw: 1. I wasn't happy that I couldn't give accurate accounts. The data for indicies 229-242, are at least all in the same year and season so the data I created shouldn't be too far off actually accounts. For the other years and especially for nw, which had many more null values, I'm concerned the data will be off.
2. If I had more time to learn and experiment with manually creating the data by doing it by season I would have. But it comes to a point in a project, where a choice has to be made on whether to spend an exorbitant amount of time going through hundreds of rows and figuring out the data manually or just imputing the values and spend more time building a strong unbiased model.
I decide to spend the time working on the models and learning seasonal decompose to try and accurately predict outbreaks other ways. I will keep these sets in mind when looking at data so I can remind myself that these data points were created and weren't actually recorded for one reason or another.</h5> 
<h5> Almost every column was had null values in them. I choice to imputer them intstead of fillna because I wanted to at least attempt to give a more accurate account based on the surrounding data then just zeros which I believe would unbalnce the data and casue a bias and when dealing with a subject such as biological, environmental factors, having realistic values will aid in producing an overall clearer picture of what is happening in that environment.</h5>
<h5>  This column is being dropped becasue it has the same values as reanalysis_sat_precip_amt_inch. I decided to drop this column over the other becasue this is the only column from a single sourced data set. The other column is coming from data set with multiple other features which are all scaled the same. The scaling is different for the column being dropped, so it makes sense to drop this column instead of the one from the group.</h5>
<h5> I decided to drop this column becasue its a reanalysis of a column alrady in the dataset and that column makes more sense
then this one. They are scaled differently and this one, once converted from Kelvin to Farenheit, didn't make sense anymore
and I was concerned it would cause more harm then good for my data set. Also, the range of tempature difference throughout the
day shouldn't make a difference unless its a large one, as in a range of 50 or more degrees, and that rarely happens in nature.</h5>

#### 5. Preprocessing
<h5> I split the data into its individual cities and then did a train test split on the divided data. I choose a 50/50 split becasue the data set is small. 936 for San Juan and 520 for Iquitos. Once the data set is split for th train test it will be half that size. I wanted to try and keep the data equal in both datasets. I normally do a 66/33 split, but I thought that would be exposing too much of the trainiing set to the model and leaving very little of the test set.</h5>
<h5> I reanalyzed the data after the split to look at what the correlation was before making a final decision on feature selection. I wanted to look at the data again to see if the correlations had changed since the data transformation and the spliting of the dataset and it has. This was something I had thougth would happened. After spending so much time analyzing the data, I noticed that there was a difference in the two cities with respect to environmental factors. These differences can be seen in the above heatmaps and correlation analysis. Iquitos has mor negatively correlated features then San Juan. Normally I would nto include these features when completeling feature secltion and splitting the data into X and y but I can't split it since the correlation data is different for each city I have to keep everything. If i drop features that would aid one city over the other I would be biasing the outcome, then again, but not dropping the features I could be biasing the outcome. If I could I would divide the datsets and treat wach one separately, analysis the cities differently and treat them as two different projects. That would be the best way to ensure that each set of data gets analyzed throughly but I can't at this point in time. So do to that and the factor mentioned above, none of the features will be dropped due to the negative correlation. This would make an interesting subject to revisit and make into another project.</h5>
<h5>The baseline was calculated by determining the mean absolute error. This metric was chosen based on what The DrvienData.org competiton desired to be the scoring determinant.</h5>

#### 6. Modeling
Gradient Boosting Regressor

Support Vector Machine Regressor

<h5> The San Juan set is larger then the Iquito set so the results differ greatly but proportionally they seem equal. The San Juan set does better with Gradient Boosting Regression while the Iquito set does better with the SVR but not by much. I would probably go with the results for the Gradient Boosting Model. I do think I need play around with some of my feature selections. Maybe if I go back and look at what has better correlations to total cases that might help. For Gradient Boosting Regression a higher n_estimator is better since the model is prone to overfitting. Max_depth is more dependent on the interaction of the input variables. I added a loss function and increased the n_estimators and this decreased the MAE value.This makes sense since 'GRB works by minimizing the loss of the model by adding weaker learners using a stage wise additive approach where one weaker learner is added at a time and existing weaker learners are frozen and left unchanged.'*1</h5>
<h5>Support Vector Machines is a supervised machince learning system that works well with small datasets and can be used for classification and regression purposes. The best hyperparameters to tune are the kernel, gamma and C. The kernel 'takes low dimensional input space and transform it to a higher dimensional space i.e. it converts not separable problem to separable problem.' There are multiple types of kernels: linear is good for linear hyper-plane and ploy and rbf are good for non-linear hyper-plane which is more relevant in classification modeling. Gamma is the kernel coefficient for poly, rbf and sigmoid. The problem with gamma is one has to remember that the higher the value of gamma the more it will try to be an exact fit which can result in overfitting and generalization error. C is the penalty parameter of the error term. A key factor to remember when tuning is to cross validate to avoid over fitting. Cross validating helps ensure that the best parameter are choosen for the data being used.*2</h5>
<h5> Gradient Boosting Regression is good for predictive modeling.'The general idea is to compute a sequence of (very) simple trees, where each successive tree is built for the prediction residuals of the preceding tree.' This model builds binary trees and runs the dta. Each time the data is run, the model partitions the data and determines the respective mean and thus the residual errors. Then the next time the next trees are fit to the residuals, find another partition to further reduce the residuals. This continues to happen and the fit becomes a better and better for the predicted value. A major problem with this type of model and all machine learning for predictive modeling is overfitting. To help avoid this, a randomness is inserted with a random state and have a higher n_estimator hyperparameter.*3</h5>

Time Series/Season Decompose Model

<h4> When examining seasonal decompose models we are looking for a number of things: The frequemcy should be the time period that the dataset is definited. In this case, it is week to week. There are 52 weeks in a year so the frequency is 52.</h4>
<h4>The observations are the avergae value of ther series. The first graph is the total cases for the entire data set. There are definting points in time where the cases are increased, which are signs of an outbreak, and there are times where the caes are at a low, which may be related to the seasons ie related to mosquito growth due to weather conditions. The second graph is the change in cases for the entire dataset. Again there are points where there are spikes. When compared to the first graph the spiked sections match. This is also true when you compare the graphs for each city. The spiked points for total cases and change in ases match up. As they should, since the change in cases data is the difference between week to week for the entire data set. It shows that from week to week during an outbreak there is an increase in the number of cases reported, then, once the outbreak is over, there is a sharp decrease in cases reported and it begin to level out again before the next increase ie outbreak.</h4>
<h4>The trend shows the increasing or decreasing value in the series. Looking at the above models the trend follows a similar pattern for all of the graphs. The trend should follow the observed, showing when there are increases or decreases throughtout the time span.</h4>
<h4> Seasonal is the repeating short term cycle in the series. I think of it as the heartbeat of the graph. Some data has more noise and variance and thus has a stronger beat compared to others which are less drastic and easier to read. Whenever the datsets are set to the change in cases column the seasonality is stronger, more rapid, and more chaotic. This is due to the fact that the chnage is being viewed differently then in toal cases. In change in cases, the data isn't static it is moving. It is negative or positive from week to week which casues the graph show this variance.</h4>
<h4> The last part is the Residual which shows the random variance in the series. It represents the remainder of the time seriers after the other components have been removed.*4</h4>


#### 7. Final Observations
<h4> This project posed some interesting challenges. Determining the best course in dealing with the null values, deciding whether to deal with the measurments in metric or convert, how to understand the different cities and their combating data sets, and how to visulaize the data to show predictive outbreaks.</h4>
<h4> Iquitos seems to be predicting better then San Juan according to the visualizations and the test set data. This could be due to a number of reasons such as San Juan has a larger dataset and its variance in cases from week to week is erratic compared to Iquitos which has a smaller dataset and a more consistant variance.</h4>
<h4>The cities have such different needs both enivronmentally and analytically. San Juan works better with the GBR model while Iquitos works better with the SVM model. The feature importance graph shows how the datasets weight differently on features. Some have a coefficient of 0 for one city while for another are higher. I believe my results would have been better if I could have treated the dataset as two different projects. Analyzed and modeled the two cities separately. Putting the data aside, and looking at the cities enivronmentally they are so different from each other. San Juan is on an island, completely surrounded by water, being effected by hurricanes, flooding, and other natural diasters. Also, it's a major city surrounded by other cities with areas of green belts. While Iquitos is the Peruvian Amazon. It is a major city surrounded by the Amazon River on oneside and the Amazon on the other side. Yet San Juan has more cases of Dengue then Iquitos. One has to wonder whether this is becasue the cases are reported more in San Juan and not in Iquitos or whether the enivinomental factors make it such that the mosquitoes that carry Dengue fever are more prevalent in San Juan then Iquitos. More research needs to be done on these outside factors.</h4>
<h4> Reviewing the models I used, I would have liked to, with more time tried other algorithims. Principal Component Analysis, for example, is good for predictions, and detecting correlections in variables. It's an unsupervised form of machine learning. Or trying XGBoost which is a form of gradiant boosting but designed for speed and performance.</h4>
<h4> Another angle I could take would be to spend more time working on feature selection and hyperparameters. There are more hyperparamters I could try with the models I have already used with the datasets. Feature selection has always been an issue with this project. As stated earlier, each city has different correlations in relation to the features. I could spend more time looking at the heatmaps and determining if dropping more features would really be determinal or not.</h4> 
<h4> I'm wondering now whether or not converting the metrics to US Standard measurements was really need and if it made an effect on my results. If I were to redo the feature engineering I would probably skip this step, though it did give me an opportunity to work on my coding skills.</h4>
<h4> Another step I would like to work on is forecasting by either trying Prophet or working with the time series model I've already had experience with.</h4>

References
1. https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/
2. https://www.analyticsvidhya.com/blog/2017/09/understaing-support-vector-machine-example-code/
3. http://www.statsoft.com/Textbook/Boosting-Trees-Regression-Classification
4. https://machinelearningmastery.com/decompose-time-series-data-trend-seasonality/