<a href="https://colab.research.google.com/github/Vincenzo-Miracula/Zayed-University/blob/main/Linear_Logistic_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Inferential Statistics
Use data analysis on a sample of data to infer properties and make predictions that cannot be derived from descriptive statistics. Examples of this could be predicting a new unknown value based on previous data (linear regression, machine learning) or hypothesis testing (such as T-tests).

# Regression in Python

***
This is a very quick run-through of some basic statistical concepts, adapted from [Lab 4 in Harvard's CS109](https://github.com/cs109/2015lab4) course. Please feel free to try the original lab if you're feeling ambitious :-) The CS109 git repository also has the solutions if you're stuck.

* Linear Regression Models
* Prediction using linear regression
* Some re-sampling methods    
    * Train-Test splits
    * Cross Validation

Linear regression is used to model and predict continuous outcomes while logistic regression is used to model binary outcomes. We'll see some examples of linear regression as well as Train-test splits.
***

***
# Part 1: Linear Regression
### Purpose of linear regression

<p> Given a dataset $X$ and $Y$, linear regression can be used to: </p>
<ul>
  <li> Build a <b>predictive model</b> to predict future values of $X_i$ without a $Y$ value.  </li>
  <li> Model the <b>strength of the relationship</b> between each dependent variable $X_i$ and $Y$</li>
    <ul>
      <li> Sometimes not all $X_i$ will have a relationship with $Y$</li>
      <li> Need to figure out which $X_i$ contributes most information to determine $Y$ </li>
    </ul>
   <li>Linear regression is used in so many applications that I won't warrant this with examples. It is in many cases, the first pass prediction algorithm for continuous outcomes. </li>
</ul>
</div>
***

In [103]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score,classification_report
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

Boston House Prices dataset
===========================

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

In [2]:
boston = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data", sep='\s+', names=["CRIM","ZN","INDUS","CHAS","NOX","RM","AGE","DIS","RAD","TAX","PRATIO","B","LSTAT","MEDV"])

In [None]:
boston.head()

In [None]:
boston.info()

In [None]:
boston.isna().sum()

In [None]:
boston.describe().T

In [None]:
correlation_matrix = boston.corr()

plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Correlation Matrix of Variables in the Boston Dataset")
plt.show()

In [None]:
boston.plot(x="RM", y="MEDV", style="o")
plt.xlabel("Average number of rooms per dwelling(RM)")
plt.ylabel("Housing Price")
plt.title("Relationship between RM and Price")
plt.show()

In [None]:
sns.regplot(y="MEDV", x="RM", data=boston, fit_reg = True)

In [None]:
boston.plot(x="RM", y="CRIM", style="o")
plt.xlabel("Average number of rooms per dwelling(RM)")
plt.ylabel("Housing Price")
plt.title("Relationship between RM and Price")
plt.show()

In [None]:
sns.regplot(y="CRIM", x="RM", data=boston, fit_reg = True)

In [None]:
plt.hist(boston['RM'])
plt.title("RM")
plt.xlabel("Average number of rooms per dwelling")
plt.ylabel("Frequency")
plt.show()

In [None]:
plt.hist(boston["MEDV"])
plt.title("MEDV")
plt.xlabel("Average price of rooms per dwelling")
plt.ylabel("Frequency")
plt.show()

## Linear regression with  Boston housing data example
***

Here,

$Y$ = boston housing prices (also called "target" data in python)

and

$X$ = all the other features (or independent variables)

which we will use to fit a linear regression model and predict Boston housing prices. We will use the least squares method as the way to estimate the coefficients.  

In [110]:
X = boston[['RM']] #feature
Y = boston["MEDV"] #target

In [111]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.3, random_state=0)

In [None]:
lm = LinearRegression()
lm.fit(X_train, Y_train)

#### What can you do with a LinearRegression object?
***
Check out the scikit-learn [docs here](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html). We have listed the main functions here.

Main functions | Description
--- | ---
`lm.fit()` | Fit a linear model
`lm.predit()` | Predict Y using the linear model with estimated coefficients
`lm.score()` | Returns the coefficient of determination (R^2). *A measure of how well observed outcomes are replicated by the model, as the proportion of total variation of outcomes explained by the model*

In [113]:
Y_test_pred = lm.predict(X_test)

In [None]:
R2 = r2_score(Y_test, Y_test_pred)
R2

In [None]:
print(f"coefficient of determination:", lm.score(X, Y))
print("intercept:", lm.intercept_)
print("coefficients:", lm.coef_)

In [None]:
plt.hist(lm.predict(X))
plt.title("PREDICTION PRICE")
plt.xlabel("Predicted Price")
plt.ylabel("Frequency")
plt.show()

In [None]:
sns.regplot(x=lm.predict(X), y=boston["MEDV"], data=boston, fit_reg=True)

In [None]:
mse = mean_squared_error(Y_test, Y_test_pred)
print("Mean Squared Error:", mse)

## Logistic Regression
A logistic regression models the relationship between a binary output and one (or more) input variables. With one input and one output variable, you're essentially finding the best-fitting curve. Instead of calculating a slope and y-intercept for a line, you're determining coefficients that minimize the difference between the observed and predicted outcomes, typically through a process like maximum likelihood estimation. Once trained, the logistic regression can classify new data points based on their input variables, assigning them to one of two categories.

In [76]:
from sklearn import datasets

In [77]:
iris = datasets.load_iris()

In [81]:
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
iris_df['species'] = iris.target_names[iris.target]

In [None]:
iris_df.head()

In [None]:
iris_df.shape

In [None]:
iris_df.info()

In [None]:
iris_df.isna().sum()

In [None]:
iris_df.describe().T

In [None]:
iris_df['species'].value_counts()

In [None]:
custom_palette = {'setosa': 'blue', 'versicolor': 'green', 'virginica': 'orange'}
sns.set_style('whitegrid')
sns.FacetGrid(iris_df, hue='species', palette=custom_palette, height=5, aspect=1.5) \
    .map(plt.scatter, 'sepal length (cm)', 'sepal width (cm)') \
    .add_legend()
plt.title("Iris Flower Sepal Length and Sepal width")
plt.show()

In [None]:
sns.set_style('whitegrid')
sns.pairplot(iris_df,hue='species',palette=custom_palette,size=4)
plt.title("Iris Flower Features")
plt.show()

In [92]:
labels = {'setosa' : 0,'versicolor' : 1,'virginica' : 2}
iris_df['species'] = iris_df['species'].replace(labels)

In [95]:
X = iris_df.drop(columns=['species']) #features
y = iris_df['species'] #target variable

In [96]:
X_train, X_test, Y_train, Y_test = train_test_split(X,y, test_size=0.3, random_state=0)

In [None]:
log_reg = LogisticRegression()
log_reg.fit(X_train, Y_train)

In [98]:
Y_test_pred = log_reg.predict(X_test)

In [None]:
R2 = r2_score(Y_test, Y_test_pred)
R2

In [None]:
print(f"coefficient of determination:", log_reg.score(X, y))
print("intercept:", log_reg.intercept_)
print("coefficients:", log_reg.coef_)

In [None]:
cr = classification_report(Y_test_pred,Y_test)
print(cr)

In [None]:
a_s = accuracy_score(Y_test_pred,Y_test)
print("Accuracy score of Logistice Regression : %.2f"%((a_s)*100),'%')

In [None]:
mse = mean_squared_error(Y_test, Y_test_pred)
print("Mean Squared Error:", mse)