# Homework 2b - Feature Extraction and Regression

In this part of the homework we'll be looking at the same dataset except in a completely different light. We'll move beyond simply analysing the data and will instead move towards trying to make some inferences regarding the data - predictions on when the dam's target value of (the minimum estimate) 1.5 Trillion rupees will be reached. Use the same set-up as part a

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

data = pd.read_pickle('./individual_contributions.pkl')
data.head()

Unnamed: 0,Bank,Name,Amount,Date
0,AL BARAKA BANK (PAKISTAN) LTD,ADC 0117,25.0,2018-09-10
1,AL BARAKA BANK (PAKISTAN) LTD,SARFARAZ 0117,100.0,2018-09-10
2,AL BARAKA BANK (PAKISTAN) LTD,HAMNA ZEESHAN 0117,100.0,2018-09-10
3,AL BARAKA BANK (PAKISTAN) LTD,ADC 0117,200.0,2018-09-10
4,AL BARAKA BANK (PAKISTAN) LTD,NOMAN 0117,200.0,2018-09-10


We'll be running a regression analysis on this data since the target variable, the funds collected, is a continuous variable. Before we are able to run any sort of regression we need to decide what features we should be using for our regression. Moreover, since we are running a regression it is important to also figure out what exactly our target variable should be. Should it be the **cumulative sum** of the amount collected **till** each day, or should it simply be the amount collected **on** each day? Whatever you decide, write code below to get that target variable. 

Hint: Using groupby on "Date" would be a good option.

In [14]:
## Code to calculate the target variable

# I can't train on the Year feature vector, because all the data collected is from the same year.

# Initially I ran my model with cummulative sum of amount collected till each day
# but I did not go with that in the end, because predicting the cummulative amount 
# without having my model train on the 'Year' feature doesn't not work.
# Months, Days, and Dates are all recurrent so the model keeps predicting the same
# cummulative amounts each year

# So I ended up predicting the daily contribution

grouped_data = data.groupby('Date').sum().reset_index()

## Part B: Feature Extraction (20)

You currently have 3 columns, other than the target variable (Amount), Bank, Name and Date. Which do you think should be used as the independent variable in running the regression? 

Ans:

One possible variable we could use is the Date variable, but it can not be used directly since it is a 'Datetime' object. Read up more on Linear Regression on the [sklearn Documentation page](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression) to know about what sort of independent variables must be sent to it.

There are many different ways you can extract the right features from just the datetime column. Some useful in-built functions include the sklearn library's [LabelEncoder](http://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder), the [OneHotEncoder](http://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder) and the [OrdinalEncoder](http://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#sklearn.preprocessing.OrdinalEncoder).

You need to think deeply about what sort of variables can be extracted from simply "Date", and how they would be useful in trying to figure out how many funds are being collected on any given day. One good way to go about it would be to try out the regression on many different features and see which one is better.

In [15]:
# Import the appropriate functions from sklearn #
# Extract the right features #
# An example of one feature that could be extracted is given below #
from sklearn.linear_model import LinearRegression

# banks = data['Bank'].unique()
# print(banks)

# This finds what day of the week it is from the datetime object where 0 is Monday and 6 is Sunday

def extract_features(grouped_data):
    grouped_data['Day_int'] = grouped_data['Date'].dt.dayofweek
    grouped_data['Date_int'] = grouped_data['Date'].dt.day
    grouped_data['Month_int'] = grouped_data['Date'].dt.month
    ## The year number should not provide any information because all data is from teh same year, day of year?
    grouped_data['Year_int'] = grouped_data['Date'].dt.year

extract_features(grouped_data)

# months = grouped_data['Month_int'].unique()
# print(months)

# print(data)

X = grouped_data['Month_int'].to_frame()
y = grouped_data['Amount']

reg_TimeOfMonth = LinearRegression()
reg_TimeOfMonth.fit(X,y)

print(reg_TimeOfMonth.score(X, y))
# as expected the cummulative amount has a very high correlation with month, and not so much with 
# day of week or day of month, but combined they can proivde a better model for prediction

# Print the entire dataframe.head() with the extracted features at the end of this cell #
grouped_data.head()
grouped_data.sort_values(['Date'])

0.06803593218120563


Unnamed: 0,Date,Amount,Day_int,Date_int,Month_int,Year_int
0,2018-07-06,2402300.0,4,6,7,2018
1,2018-07-09,1346261.0,0,9,7,2018
2,2018-07-10,5374641.0,1,10,7,2018
3,2018-07-11,24830020.0,2,11,7,2018
4,2018-07-12,29174820.0,3,12,7,2018
5,2018-07-13,22234760.0,4,13,7,2018
6,2018-07-16,31087520.0,0,16,7,2018
7,2018-07-17,46139220.0,1,17,7,2018
8,2018-07-18,39317810.0,2,18,7,2018
9,2018-07-19,32733300.0,3,19,7,2018


## Part C: Regression and Evaluation (40)

From here onwards, how exactly you structure your code is upto you, and the main goal is this: You want to choose a regression model from one of the many [linear_models](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model) available on sklearn. If you're feeling adventurous you can try using [Support Vector Regression](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html) as well, but unless you guys take out the time to understand how Support Vector Machines work, and why they might not be the best idea for such a dataset, it will not be a fruitful exercise.

You need to learn how to evaluate your model. Every sklearn regression model has a built in function that can calculate the regression score for you (as done before), and the sklearn [Mean Squared Error](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html) function will be used to calculate the error in your test-set and your train-set. In most cases you will use either a custom function to split the dataset into a train-test set, or use the [train-test-split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function. Another extremely useful tool is [KFold cross-validation](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html). Research on cross-validation and why it is such an effective way to evaluate your Machine Learning models.

For the purpose of this assignment, the final values will be of the regression being **trained and tested on the entire dataset**. And the following results will be looked at (print these values clearly!):
1. Regression Score 
2. Mean Squared Error (Expect this to be really high, since the values of the data-set are also high)
3. The Regression Line that you get from the linear-models (either from the coefficients or from the predictions) over the data-points (I will upload a sample on Piazza)

**Lastly**, after you have trained your model, you need to build a mock data-set containing just the datetime objects. A good function to use is [python.date_range](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.date_range.html) that allows you to get a DateTimeIndex of whatever Date and Frequency you chose (the frequency is an extremely important parameter). You can convert that DateTimeIndex to a DataFrame and then extract the same features as you did in the previous part (making a function for feature extraction is a good idea). After that you need to print the exact **Month and Year that the 1.5 Trillion Rs target will be reached according to your regression.**

**This is an iterative process and you will have to play around with the features, the model and parameters of the regression many times before you reach a good result**

In [20]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import ElasticNetCV
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

# I scaled my data so my intercepts and Mean Squared Errors are not so large

X = grouped_data[['Day_int', 'Date_int', 'Month_int']]
y = grouped_data['Amount']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scalerX = StandardScaler().fit(X_train)
scalery = StandardScaler().fit(y_train.values.reshape((-1,1)))

X_train = scalerX.transform(X_train)
X_test = scalerX.transform(X_test)

y_train = scalery.transform(y_train.values.reshape(-1,1)).ravel()
y_test = scalery.transform(y_test.values.reshape((-1,1))).ravel()

regr = ElasticNetCV(l1_ratio = [.1, .5, .6, .7, 0.85, 0.9, .95, .99], cv = 3, n_alphas = 100, copy_X=True,
        max_iter=1000, random_state=0, tol=0.0001)

regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)

print("alpha: ", regr.alpha_)
print("regression score: ", regr.score(X, y))

#print("regression score: ", regr.score(X, y))

print("regression intercept: ", regr.intercept_) 
print("regression coefficients: ", regr.coef_)
err = mean_squared_error(y_pred, y_test)
print("mean squared error: ", err)

# giving an arbritary range to the dates range
dates = pd.date_range(start='2018-07-06', end='2040-07-06').to_frame()
dates.columns = ['Date']
extract_features(dates)

target = 1.5 * (10**12)

my_predictions_scaled = regr.predict(dates[['Day_int', 'Date_int', 'Month_int']])

# scaling back the predictions
my_predictions = scalery.inverse_transform(my_predictions_scaled)

sum, i = 0, 0
while sum<target:
    sum = sum + my_predictions[i]
    i = i+1;
    
print(dates.iloc[i]['Date'])

alpha:  0.2347943755240502
regression score:  -0.39308616394638607
regression intercept:  2.0817520167155173e-16
regression coefficients:  [-0.10952233  0.09404737  0.20757438]
mean squared error:  0.07621982895397654
2028-07-07 00:00:00


#### What do you think the limitations of your regression were? What problems did you face in not being able to get a good fit?

The data set is **not large enough** to run a multiple regression properly, it is easy to **overfit** on the data.
Morover, if data from several years, rather than several months was available then maybe a "yearly-trend" could also be integrated in a model, which would make sense in this context.

There was no proper dealing of the **outliers**, such as the army donation, which skews our data.
On the one hand, these outliers cannot be discarded because they constitute a large percentage of the total funds, and are needed to gain an accurate estimate of the donation pattern, and to estimate a time to reach the 1.5 trillion target of the dam project. On the other hand, the exponentially large donations drown the effect of the smaller donations.
Maybe with more data, a seperate regression could be run on the pattern of the outlier donations, such as those from large industries, because in part(a), my calculations showed that the cummulative outlier contribution was more than the cummulative inlier contribution.

Using a multiple Linear regression implies that the donation pattern wil remain constant with relation to my feature vectors, i.e month (1-12), day of month (1-30) and day of week(1-7). This is not likely to be the case in donations, because donation patterns change due to social hype and awareness etc. We don't have a feature vector to capture that.