## Homework


Use logistic regression to predict `above_average` which is 1 if the median_house_value is above its mean value and 0 otherwise.

### Dataset

In this homework, we will use the California Housing Prices data from [Kaggle](https://www.kaggle.com/datasets/camnugent/california-housing-prices).

https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv



# 1 Data preparation
* Download the data, read it with pandas
* Look at the data
* Evaluate and handle null values
* Do feature engineering
* Perform one-hot-encoding to categorical variables


In [None]:
import pandas as pd 
url = 'https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv'
df = pd.read_csv(url)

In [None]:
df.head()

Select numerical and categorical features

In [None]:
numerical = [
 'latitude',
 'longitude',
 'housing_median_age',
 'total_rooms',
 'total_bedrooms',
 'population',
 'households',
 'median_income',
 'median_house_value',
 'ocean_proximity'
 ]
categorical = ["ocean_proximity"]

Check null values

In [None]:
df.isnull().sum()

Fill N/A with 0

In [None]:
df['total_bedrooms']= df['total_bedrooms'].fillna(0)

Feature Engineering

In [None]:
df["rooms_per_household"] =df['total_rooms']/df['households']
df["bedrooms_per_room"] =df['total_bedrooms']/df['total_rooms']
df["population_per_household"] =df['population']/df['households']

### One-hot encoding

* Encode categorical features


In [None]:
ocean_proximity_onehot = pd.get_dummies(df.ocean_proximity, prefix='ocean_proximity=')
df = pd.concat([df,ocean_proximity_onehot], axis=1)
df.head()


### Question 1

What is the most frequent observation (mode) for the column `ocean_proximity`?

In [None]:
df["ocean_proximity"].value_counts()

<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: ocean_proximity, dtype: int64

### Question 2

* Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your train dataset.
    - In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.
* What are the two features that have the biggest correlation in this dataset?

In [None]:
pd.options.display.float_format = "{:,.2f}".format

df[numerical].corr()

Unnamed: 0,latitude,longitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
latitude,1.0,-0.92,0.01,-0.04,-0.07,-0.11,-0.07,-0.08,-0.14
longitude,-0.92,1.0,-0.11,0.04,0.07,0.1,0.06,-0.02,-0.05
housing_median_age,0.01,-0.11,1.0,-0.36,-0.32,-0.3,-0.3,-0.12,0.11
total_rooms,-0.04,0.04,-0.36,1.0,0.92,0.86,0.92,0.2,0.13
total_bedrooms,-0.07,0.07,-0.32,0.92,1.0,0.87,0.97,-0.01,0.05
population,-0.11,0.1,-0.3,0.86,0.87,1.0,0.91,0.0,-0.02
households,-0.07,0.06,-0.3,0.92,0.97,0.91,1.0,0.01,0.07
median_income,-0.08,-0.02,-0.12,0.2,-0.01,0.0,0.01,1.0,0.69
median_house_value,-0.14,-0.05,0.11,0.13,0.05,-0.02,0.07,0.69,1.0


total_bedrooms aand households

### Make `median_house_value` binary

* We need to turn the `median_house_value` variable from numeric into binary.
* Let's create a variable `above_average` which is `1` if the `median_house_value` is above its mean value and `0` otherwise.



In [None]:
median_house_value_mean = df["median_house_value"].mean()
df["median_house_value"] =df["median_house_value"] >median_house_value_mean

In [None]:
df["median_house_value"].value_counts()

False    12255
True      8385
Name: median_house_value, dtype: int64

### Split the data

* Split your data in train/val/test sets, with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the `train_test_split` function) and set the seed to 42.
* Make sure that the target value (`median_house_value`) is not in your dataframe.

In [None]:
from sklearn.model_selection import train_test_split

df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=42)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=42)

In [None]:
df_train.shape,df_test.shape, df_val.shape

((12384, 18), (4128, 18), (4128, 18))

In [None]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [None]:
y_train = df_train.median_house_value.values
y_val = df_val.median_house_value.values
y_test = df_val.median_house_value.values

del df_train['median_house_value']
del df_val['median_house_value']
del df_test['median_house_value']

### Question 3

* Calculate the mutual information score with the (binarized) price for the categorical variable that we have. Use the training set only.
* What is the value of mutual information?
* Round it to 2 decimal digits using `round(score, 2)`

In [None]:
from sklearn.metrics import mutual_info_score

def mutual_info_churn_score(series):
    return mutual_info_score(series,y_train)

mi = df_train[categorical].apply(mutual_info_churn_score)
mi.sort_values(ascending=False)

ocean_proximity   0.10
dtype: float64

### Question 4

* Now let's train a logistic regression
* Remember that we have one categorical variable `ocean_proximity` in the data. Include it using one-hot encoding.
* Fit the model on the training dataset.
    - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    - `model = LogisticRegression(solver="liblinear", C=1.0, max_iter=1000, random_state=42)`
* Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

In [None]:

features = df_train.columns[df_train.dtypes != "object"]
X_train = df_train[features]

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver="liblinear", C=1.0, max_iter=1000, random_state=42)
model.fit(X_train, y_train)


LogisticRegression(max_iter=1000, random_state=42, solver='liblinear')

In [None]:
X_val = df_val[features]
model.score(X_val, y_val)

0.8352713178294574

### Question 5 

* Let's find the least useful feature using the *feature elimination* technique.
* Train a model with all these features (using the same parameters as in Q4).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature. 
* Which of following feature has the smallest difference? 
   * `total_rooms`
   * `total_bedrooms` 
   * `population`
   * `households`

In [None]:
myfeatures = features.to_list()
score=[]
for feature in myfeatures:
  partial_features = myfeatures.copy()
  partial_features.remove(feature)
  model = LogisticRegression(solver="liblinear", C=1.0, max_iter=1000, random_state=42)
  X_train = df_train[partial_features]
  model.fit(X_train, y_train)
  X_val = df_val[partial_features]
  score =model.score(X_val, y_val)
  print(round(score*100,2),round(abs(0.835-score)*100,3),feature)
  

83.36 0.142 latitude
83.24 0.264 longitude
83.09 0.409 housing_median_age
83.75 0.245 total_rooms
83.5 0.003 total_bedrooms
82.63 0.869 population
83.31 0.191 households
78.66 4.842 median_income
83.58 0.076 ocean_proximity=_<1H OCEAN
83.62 0.124 ocean_proximity=_INLAND
83.62 0.124 ocean_proximity=_ISLAND
83.6 0.1 ocean_proximity=_NEAR BAY
83.43 0.07 ocean_proximity=_NEAR OCEAN
83.5 0.003 rooms_per_household
83.5 0.003 bedrooms_per_room
83.67 0.172 population_per_household


total beedrooms is the less significant feature