# California House Price Model

## Project Description
Your task is to predict the average house values in Californian districts, given a number of features for each district:
- Location (Longitude and Latitude)
- Average Houses' Age
- Total Rooms
- Total Bedrooms
- District Population
- Number of Households
- Average Annual Income
- Average House Value
- Proximity to Ocean Categories:One Hour Away from Ocean (1H Ocean), Inland, Near Ocean, Near Bay, Island    

## 1. Import the Basic Libraries

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## 2. Importing the Dataset

In [None]:
#data = pd.read_csv('CaliforniaHouseData.csv')
data = pd.read_csv()
data.head()

### 2.1 Display a Random Sample of Data

In [None]:
import random
my_random_subset = random.sample(range(len(data)), 10)
data.iloc[my_random_subset]

## 3. Exploratory Data Analysis (EDA)
In this step, we will use data visualization methods to explore the main characteristics of the dataset.

### 3.1. Reviewing the data for some general information

In [None]:
data.info()

In [None]:
data['Ocean_proximity'].value_counts()

### 3.2. Reviewing overall statistical information

In [None]:
data.describe()

### 3.3. Getting some insights by plotting differnet variables

In [None]:
data.hist()

### 3.4. Visualizing the data based on the location

In [None]:
data.plot(kind = 'scatter', )

If you pay attention, you notice, the shape of the above graph looks like California! We can also add a new paramter, alpha (transparency), to show shows the denser area with a more intense color. Let's set alpha= 0.05, 0.1, 0.3

In [None]:
for :
  data.plot(kind = 'scatter', x = 'Longitude', y = 'Latitude', alpha = )
  
plt.show()

Now, you can clearly see some high-density areas: e.g., areas near the Bay area, Los Angeles, and San Diego. Furthermore, we can visualize the data to check if housing price is related to location and population density.

### 3.5. Visualizing the data based on the house value and population
We can plot the above graph; but this time we show each district's population by radius of each circle (option s), and the color represnts the price (option c). So, the bigger the radius of the dots, means higher population, and a range of colors for the value.   

In [None]:
data.plot(kind = 'scatter', x = 'Longitude', y = 'Latitude', label = 'Population',
          s = , c = , cmap = plt.get_cmap('jet'),
          colorbar = True, alpha = 0.1, figsize = [15, 10])
plt.legend()

We can see that housing price seems to be very related to location and population density.

### 3.6. Checking correlations with the house value
- Here, we want to see how house value correlates to differnt parameters. We can compute the standard correlation coefficint (The Pearsons's r) between house value and other parameters. 
- The correlation coefficient ranges from -1 to +1. When it is close to +1, it means that there is a strong positive correlation (for example, by increasing the parameter the house value goes up); and when it is close to -1, it means that there is a strong negative correlation(for example, by increasing the parameter the house value comes down).<br>
- Please note, correlation coefficient 1 means, comparing a parameter with itself. 

In [None]:
r_matrix = data.corr()
r_matrix.head()

**Create a Heatmap for the correlation Matrix**

In [None]:
r_mask = np.triu(np.ones_like(r_matrix, dtype = bool))
from seaborn import heatmap
plt.figure(figsize = [6,5], dpi = 100)
plt.title('Correlatin Heatmap')
heatmap(, mask=, annot=True, lw=1, linecolor='White', cmap='Blues', fmt = "0.2f")

In [None]:
r_matrix['Value'].sort_values()

In [None]:
data.plot(kind = 'scatter', x = '', y = '', alpha = 0.2, figsize = [7, 5]) #plot the most correlated ones

In [None]:
data.plot(kind = 'scatter', x = '', y = '', alpha = 0.2, figsize = [7, 5]) #plot the most inversely correlaed ones

### 3.7. Creating new features and checking their correlations with the house value

In [None]:
data["Rooms_per_household"] = data["Total_rooms"] / data["Households"]
data["Bedrooms_per_room"] = data[""] / data[""]
data["Population_per_household"] = data[""] / data[""]
r_matrix = data.corr()
r_matrix

In [None]:
r_mask = np.triu(np.ones_like(r_matrix, dtype = bool))
from seaborn import heatmap
plt.figure(figsize = [7,6], dpi = 100)
plt.title('Correlatin Heatmap')
heatmap(r_matrix, mask=r_mask, annot=True, lw=1, linecolor='White', cmap='magma', fmt = "0.2f")

In [None]:
r_matrix['Value'].sort_values()

Rooms_per_household and bedrooms_per_room have better correlations with the house value than population_per_household.

## 4. Preprocessing the Data for Machine Learning

In [None]:
data.info()

### 4.1. Rearranging the sequence of the data
- Put all numerical data in the first 10 columns.
- Put house value at the 2nd last column.
- Put the catogorical data (ocean proximity) as the last column.

In [None]:
my_string = data.columns
my_string

In [None]:
new_columns = []
data = data.reindex(columns=new_columns)

In [None]:
data.info()

We have some missing data.

### 4.2. Seperating the input data from the output data.

In [None]:
X = data.iloc[].values #skipping Ocean_proximity for now
y = data.iloc[].values

### 4.3. Imputing the missing numerical data

In [None]:
from sklearn.impute import SimpleImputer
my_imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
X = my_imputer.fit_transform(X)

### 4.4. Taking care of the outliers in the numerical data

In [None]:
from sklearn.neighbors import LocalOutlierFactor
my_lof = LocalOutlierFactor(contamination=0.01)
y_hat = my_lof.fit_predict(X) #returns +1 for inliers and -1 for outliers
outlier_mask = (y_hat != )
print('Before Outlier removal:\nX.shape = ', X.shape, ' and y.shape = ', y.shape)
X, y = X[], y[]
print('After Outlier removal: \nX.shape = ', X.shape, ' and y.shape = ', y.shape)

### 4.5. Scaling the numerical data 

In [None]:
from sklearn.preprocessing import StandardScaler
my_sc = StandardScaler()
X_std = my_sc.fit_transform(X)

### 4.6. Encoding the categorical data

In [None]:
from sklearn.preprocessing import OneHotEncoder
my_enc = OneHotEncoder()
encoded_data = my_enc.fit_transform(data['Ocean_proximity'].values.reshape(-1, 1)).toarray()
# my_enc.fit_transform() accepts only column vectors and returns sparse matrix
# you need to convert the matrix into ndarray
encoded_data.shape

In [None]:
encoded_data = encoded_data[outlier_mask]
encoded_data.shape

##### Combining the numerical and categorical training data

In [None]:
X = np.concatenate((X, encoded_data), axis = )

##### A quick check on the preprocssed training data

In [None]:
pd.DataFrame(X)

### 4.7. Splitting the dataset into the training set and test Set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

## 5. Training the Linear Regression Model with the Training Set

In [None]:
from sklearn.linear_model import LinearRegression
my_model = LinearRegression()
my_model.fit(, )

## 6. Checking the Trained Model with the Test Set

In [None]:
y_pred = my_model.predict()
from sklearn.metrics import r2_score
score = r2_score(y_test, y_pred)
print('R2 Score = %.3f' %score)

# 7. Exploring other methods


In [None]:
from sklearn.ensemble import RandomForestRegressor
rf=RandomForestRegressor(n_estimators=30)
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)
rf_score = r2_score(y_test,y_pred)
print('R2 Score = %.3f' %rf_score)