# Regression - Excercise

```In this exercise you will experience with simple features of the linear and logistic regressions, and will get to know some interesting features of those regressions.```

~```Ittai Haran```

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
import seaborn as sns

## Coefficients and some more

```We will start by exploring several concepts regarding linear and logistic regression. To that matter, we will prefer working using a generated dataset. Start by generating (using numpy and np.random) the following dataset:```

$X \sim \cal N(0,1)^3$ ```(i.e X consists of 3-dimensional vectors)```

$Y = 0.3\cdot X[:,0] + 0.5\cdot X[:,1] - 0.7\cdot X[:,2] + 1$

```Generate 1,000 samples.```

In [None]:
X = np.random.uniform(0,1 , size=(1000, 3))

In [None]:
Y = X[:,0] + 0.5*X[:,1]-0.7*X[:,2] + 1

```Train a simple linear regression (sklearn.linear_model.LinearRegression) on the data. What coefficients did you get using your regression? Did you expect those coefficients?```

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
clf = LinearRegression()
clf.fit(X,Y)

In [None]:
f, axes = plt.subplots(1, 3 , figsize=(20,5))
for i in range(3):
    sns.scatterplot(X[:,i], Y, ax=axes[i])
plt.show()

In [None]:
clf.coef_ # the coefs are expcted according to the plots above (up, nomal and down)

```We will now conduct a similar experiment, only this time using logistic regression, with minor adjustments:```

$X \sim \cal N(0,1)^3$ ```(i.e X consists of 3-dimensional vectors)```

$Y = (0.3\cdot X[:,0] + 0.5\cdot X[:,1] - 0.7\cdot X[:,2] \geq 1)$

```Generate 1,000 samples.```

In [None]:
X = np.random.uniform(0,1 , size=(1000, 3))

In [None]:
Y = (X[:,0] + 0.5*X[:,1]-0.7*X[:,2])

In [None]:
for idx,item in enumerate(Y):
    if (item>1):
        Y[idx]=1
    else:
        Y[idx]=0

```Train a simple logistic regression (sklearn.linear_model.LogisticRegression) on the data. What coefficients did you get using your logistic regression? Did you expect those coefficients? Why, or why not?```

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
clf = LogisticRegression()
clf.fit(X,Y)

In [None]:
df  = pd.DataFrame(index= [X[:,0],X[:,1],X[:,2],Y]).reset_index()

In [None]:
df.head()

In [None]:
clf.coef_

In [None]:
f, axes = plt.subplots(1, 3 , figsize=(20,5))
for i in range(3):
    sns.scatterplot(X[:,i], Y, ax=axes[i])
plt.show()

In [None]:
f, axes = plt.subplots(1, 3 , figsize=(20,5))
sns.regplot(x='level_0', y='level_3', data=df, logistic=True , ax=axes[0])
sns.regplot(x='level_1', y='level_3', data=df, logistic=True , ax=axes[1])
sns.regplot(x='level_2', y='level_3', data=df, logistic=True , ax=axes[2])

```Repeat this experiment, this time with 10,000 samples and 100,000 samples. Did you get different coefficients? Why or why not?```

In [None]:
X = np.random.uniform(0,1 , size=(10000, 3))

Y = (X[:,0] + 0.5*X[:,1]-0.7*X[:,2])

for idx,item in enumerate(Y):
    if (item>1):
        Y[idx]=1
    else:
        Y[idx]=0
        
clf = LogisticRegression()
clf.fit(X,Y)

df  = pd.DataFrame(index= [X[:,0],X[:,1],X[:,2],Y]).reset_index()

In [None]:
clf.coef_

In [None]:
f, axes = plt.subplots(1, 3 , figsize=(20,5))
sns.regplot(x='level_0', y='level_3', data=df, logistic=True , ax=axes[0])
sns.regplot(x='level_1', y='level_3', data=df, logistic=True , ax=axes[1])
sns.regplot(x='level_2', y='level_3', data=df, logistic=True , ax=axes[2])

In [None]:
X = np.random.uniform(0,1 , size=(100000, 3))

Y = (X[:,0] + 0.5*X[:,1]-0.7*X[:,2])

for idx,item in enumerate(Y):
    if (item>1):
        Y[idx]=1
    else:
        Y[idx]=0
        
clf = LogisticRegression()
clf.fit(X,Y)

df  = pd.DataFrame(index= [X[:,0],X[:,1],X[:,2],Y]).reset_index()

In [None]:
clf.coef_

In [None]:
f, axes = plt.subplots(1, 3 , figsize=(20,5))
sns.regplot(x='level_0', y='level_3', data=df, logistic=True , ax=axes[0])
sns.regplot(x='level_1', y='level_3', data=df, logistic=True , ax=axes[1])
sns.regplot(x='level_2', y='level_3', data=df, logistic=True , ax=axes[2])

In [None]:
df['level_3'].value_counts() #there are much more zeros than ones. thats why the plots looks like this and the coefs also

## Non-linear linear regressions

```Load the data in func_1_train.csv.
Can be found in: https://drive.google.com/open?id=1y3HtVk0N1q4xYn_qczDcdkZfGpy23z9l```

In [None]:
df = pd.read_csv('func_1_train.csv')

```Draw a scatter plot of y as a function of x. What kind of functions would you like to fit here?```

In [None]:
df.head()

In [None]:
sns.scatterplot(df['x'] , df['y'])
plt.show() #it should not be a linear model..

```Try fitting a linear regression. Draw the data points and the function you fitted on the same plot.```

In [None]:
clf = LinearRegression()
clf.fit(df[['x']], df[['y']])

In [None]:
Y_pred = clf.predict(df[['x']])

In [None]:
from sklearn.metrics import mean_squared_error
mean_squared_error(df['y'], Y_pred)

In [None]:
sns.scatterplot(df['x'] , df['y'])
sns.scatterplot(df['x'] , Y_pred[:,0])
plt.show()

```Let's do some feature extraction. Create a dataset with the features (X, X**2, X**3, ..., X**49). Now try fitting a linear regression using your new dataset. Draw your results on the same graph as before. Judge your results using func_1_test and using the mean squared error metric.```

```Can be found in: https://drive.google.com/open?id=1ipm09QTjVZWFgh-zc70rxWLGPWYEt1Kb```

In [None]:
df_test = pd.read_csv('func_1_test.csv')

### feature extraction for train

In [None]:
for idx,x in enumerate(df['x'] , start=2):
    if(idx>49):
        break
    df['x'+str(idx)] = np.power(df['x'],idx)

In [None]:
df.head()

### feature extraction for test

In [None]:
for idx,x in enumerate(df_test['x'] , start=2):
    if(idx>49):
        break
    df_test['x'+str(idx)] = np.power(df_test['x'],idx)

In [None]:
target = df['y']
X_train = df.drop('y' , axis=1)
clf = LinearRegression()
clf.fit(X_train , target)

In [None]:
Y_pred_train = clf.predict(X_train)
Y_pred_test = clf.predict(df_test.drop('y' , axis=1))

In [None]:
clf.coef_

In [None]:
sns.scatterplot(df_test['x'] , Y_pred_test)
sns.scatterplot(df_test['x'] , df_test['y'])
plt.show()

In [None]:
from sklearn.metrics import mean_squared_error
mean_squared_error(df_test['y'], Y_pred_test)

```How come you could a model that isn't under fitted? You did still use a linear regression. Can you explain those results? However, your model seemed to be over fitted (why?). Why is that?```

In [None]:
# answer: it is indeed a lineal regression but we have much more features here, each one has an influence on the predictions thats why it is more complicated model after all. the model is overfitted bacause the values on the train are similar to the test's value and again, many features 
# and small dataset..

```You can try using a regularized regression to avoid over fitting. Use sklearn.linear_model.Ridge with different alphas until you get nice fit (judge it using your plots and using func_1_test and the mean squared error metric). Could you get better results?```

In [None]:
from sklearn.linear_model import Ridge

In [None]:
clf = Ridge(alpha=1.5)
clf.fit(X_train , target)

In [None]:
Y_pred_test = clf.predict(df_test.drop('y' , axis=1))

sns.scatterplot(df_test['x'] , Y_pred_test)
sns.scatterplot(df_test['x'] , df_test['y'])
plt.show()

In [None]:
from sklearn.metrics import mean_squared_error
mean_squared_error(df_test['y'], Y_pred_test)

## Working on a real life data

```Start by loading house_data.csv. For our current purposes we will need only the numeric columns. Take only the numeric columns, using pandas.DataFrame.dtypes. Use fillna(0) to get rid of nans (the the column Id. can tell why?).```

```Can be found in: https://drive.google.com/open?id=1ID2h8mzjXLRbay5pE0QN1v5-jc86203L```

In [None]:
df = pd.read_csv('house_data.csv')
df_original = df.copy()

In [None]:
df = df.loc[:, df.dtypes != 'object']
df.drop('Id' , axis=1 , inplace=True)

In [None]:
df = df.fillna(0)

```We would like to predict the SalePrice columns. Create, using your data, the features dataset and the target dataset. Use sklearn.model_selection.train_test_split and create a train segment of 0.7 of your data and test segment of 0.3 of your data.```

In [None]:
df_x = df.drop('SalePrice', axis=1)
df_y = df['SalePrice']

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df_x, df_y, test_size=0.3, random_state=42)

```Try fitting the best linear regression you can (evaluate yourself using mean squared error). You can also use ridge regression with different alphas.```

In [None]:
clf = Ridge(alpha=3000)
#clf = LinearRegression()
clf.fit(X_train , y_train)

In [None]:
y_test_pred = clf.predict(X_test)

In [None]:
mean_squared_error(y_test, y_test_pred)

```If you would look closely, you will find out that you dropped the columns LotShape and LandContour. This time try not dropping them, and instead replace them with a 1-hot encoding of them (consider``` [pandas.get_dummies](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html))```.
Try getting better results on the test segment using your ridge regressions. compare the new results with the old ones.```

In [None]:
df_original = df_original.fillna(0)

In [None]:
df_with_dum = pd.get_dummies(df_original, columns=['LotShape','LandContour'] , dtype='int64')

In [None]:
df_with_dum.drop('Id' , axis=1 , inplace=True)
df_with_dum = df_with_dum.loc[:, (df_with_dum.dtypes != 'object')  ]

In [None]:
df_x = df_with_dum.drop('SalePrice', axis=1)
df_y = df_with_dum['SalePrice']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df_x, df_y, test_size=0.3, random_state=42)

clf = Ridge(alpha=20)
#clf = LinearRegression()
clf.fit(X_train , y_train)

In [None]:
y_test_pred = clf.predict(X_test)

mean_squared_error(y_test, y_test_pred)

```Now add more features to your dataframe:```
- ```LotArea in squared meters ( it's currently in units of squared feet)```
- ```1stFlrSF + 2ndFlrSF```
- ```GarageArea**0.5```
- ```LotArea / (BedroomAbvGr+1)```
- ```LotArea / (mean LotArea for houses built in that same year + 1e-5) - you might want to use``` [pandas merge function](https://www.google.com/search?q=pandas+merge&oq=pandas+merge&aqs=chrome..69i57l2j69i59l3j69i60.2080j0j9&sourceid=chrome&ie=UTF-8)
- ```Ranking of LotArea (largest house has 1, the second largest has 2 and so on)```
- ```One hot encoding of LotConfig```

```Are they improve the results?.```

In [None]:
df_with_extend = df_original
df_with_extend = df_original.fillna(0)

In [None]:
df_with_extend = pd.get_dummies(df_with_extend, columns=['LotShape','LandContour','LotConfig'] , dtype='int64')

In [None]:
df_with_extend.drop('Id' , axis=1 , inplace=True)
df_with_extend = df_with_extend.loc[:, (df_with_extend.dtypes != 'object')  ]

In [None]:
df_with_extend['LotArea_meter'] = df_with_extend['LotArea'] / 3.28084 # feet to meter
df_with_extend['GarageArea'] = df_with_extend['GarageArea']**0.5
df_with_extend['LB'] = df_with_extend['LotArea'] / (df_with_extend['BedroomAbvGr']+1)

In [None]:
year_area = pd.DataFrame(df_with_extend.groupby(['YearBuilt'])['LotArea'].mean()).reset_index()
year_area.rename(columns={'LotArea' : 'LotAreaPerYear'} , inplace=True)
df_with_extend = df_with_extend.merge(year_area , on='YearBuilt')
df_with_extend['LotArea_mean_year'] = df_with_extend['LotArea'] / df_with_extend['LotAreaPerYear']

In [None]:
df_with_extend['LotArea_rank'] = df_with_extend['LotArea'].rank(method='max')

In [None]:
df_x = df_with_extend.drop('SalePrice', axis=1)
df_y = df_with_extend['SalePrice']


X_train, X_test, y_train, y_test = train_test_split(df_x, df_y, test_size=0.3, random_state=42)

clf = Ridge(alpha=50)
#clf = LinearRegression()
clf.fit(X_train , y_train)

y_test_pred = clf.predict(X_test)

mean_squared_error(y_test, y_test_pred) # didnt improve the result :(

## Bonus

```Think of a feature of your own that improve the result.```

In [None]:
df_with_extend['LotAreaMeter_divideByYear'] = df_with_extend['LotArea_meter']/df_with_extend['YearBuilt']
df_with_extend['untillYearRemodAdd'] = df_with_extend['YearBuilt'] - df_with_extend['YearRemodAdd']

```Use KNN regression (sklearn.neighbors.KNeighborsRegressor).
Compare the results to the linear regression.```

In [None]:
from sklearn.neighbors import KNeighborsRegressor
clf = KNeighborsRegressor(n_neighbors=4)

X_train, X_test, y_train, y_test = train_test_split(df_with_extend.drop('SalePrice', axis=1), df_y, test_size=0.3, random_state=42)
clf.fit(X_train , y_train)
y_test_pred = clf.predict(X_test)

mean_squared_error(y_test, y_test_pred) # didnt improve the result :(