# Case Study

This notebook is designed to teach you about some Data Science tools. It will cover these core concepts:
- Explore and understand
- Preprocess
- Management
- Analysis
- Visualization

***
## Explore and Understand
***
We will perform an Exploratory Data Analysis (EDA). This will show us what the data looks like, how big the data is, what type of data is present, if there are any missing values, and a correlation between the data.


This is useful to us as we can determine if the data we are working with is clean or not. If it is unclean, then we will have to fix it to make it usable.

### Importing Libraries and Data Analysis

In [None]:
# These are the libraries we will be using
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns


<br><br>
We will now read in the csv dataset using the pandas library. We can then display the first 5 rows to see how the data looks. Pandas is a software library for data manipulation and analysis (https://pandas.pydata.org/docs/).

In [None]:
# Read dataset 
dataframe = pd.read_csv("Octanol_Water.csv")

# This displays the top 5 rows of our data
dataframe.head()

<br><br>
We will show the shape of our dataset. This tells us how many rows and columns there are in the form (rows, columns).

In [None]:
# Size of data
dataframe.shape

<br><br>
We will now get the headings from all the columns.

In [None]:
#Features
dataframe.columns

<br><br>
We can now get the data types that are present and the count of each.

In [None]:
dataframe.info()

<br><br> 
As we can see there are a couple of different data types (int64 and float64). We can see that there are 16 non-null values for each column which tells us there is no missing data (we know the shape is 16 rows long).

Just to be sure, we can see how many null values are present:

In [None]:
# Number of all missing datapoints
dataframe.isna().sum()

We can see that the above data has no missing data points. If there were any missing values, there are a few things we can do to handle the data (it will be covered further down in the notebook).

<br><br>
We will now see how correlated the data is. The below plot shows which variables (or columns) are strongly correlated to other columns. The darker the color is, the more correlated those variables are to one another.

In [None]:
sns.heatmap(dataframe.corr(), cmap= 'coolwarm')

***
## Preprocessing
***
This is where we handle the unclean data. We need to fix any missing values so that we can perform linear regression on the data.

This is also where we can delete any columns we don't need for our calculations.

### Handling the Missing Data
Missing data in the training data set can reduce the power / fit of a model or can lead to a biased model because we have not analysed the behavior and relationship with other variables correctly. It can lead to wrong prediction or classification.

Common methods to treat missing values:

1) Deletion

2) Mean/ Mode/ Median/ Zero Imputation

<br><br>
In the case of this dataset we can just get rid of the rows that include missing values. (We know the data has no missing elements, but if it did, this is how we would do it).

In [None]:
# Drop the rows that have missing values
# dataframe.fillna(0, inplace=True)  # fills missing elements with 0's
dataframe.dropna(inplace=True)  # gets rid of rows that have missing data

<br><br> 
Now we can double check that there are no missing values.

In [None]:
# Number of all missing datapoints
dataframe.isna().sum()

<br>
Great! No missing values. Now we can move on.<br>

<br><br>
To perform our Linear Regression, we only need to focus on the following columns:
- log K_ow
- BCF

This means we can disregard the rest of the columns when performing our Linear Regression.

<br><br>
Now that we know what to focus on, we will begin the process for Linear Regression. 

***
## Management and Analysis
***
This section we will perform our linear regression algorithms on the data.

We will first look at 

In [None]:
# INPUT
X =pd.DataFrame(dataframe['log K_ow'])
# Target/ OUTPUT
Y =pd.DataFrame(dataframe['BCF'])

In [None]:
X.head()

In [None]:
Y.head()

<br><br>
We need to split the data into two sets. One set will be used to train the linear regression model, while the other set will be used to test the linear regression model.

In [None]:
# We need to split the data into a training and testing set
# This library will give us an easy way to split our data
from sklearn.model_selection import train_test_split

# Split the Input and Target variables into TRAINING and TESTING data
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state=1)

<br><br>
***
## Your Turn
### Now we have our data in a format we can use for Linear Regression
You can attempt to use the algorithms developed in the previous notebook. Please replace the ___ with the correct variables or functions.

In [None]:
# This is the data we are using
x_data = X_train.to_numpy().flatten()
y_data = y_train.to_numpy().flatten()

# Perform the square of residuals
# Get matrix y
y = np.array([ sum(y_data), sum(y_data * x_data) ])

# Get matrix A
A = np.array([ [float(sum(x_data)), len(x_data)], [float(sum(x_data*x_data)), float(sum(x_data))] ])

# Get m and b
m_b = (np.linalg.inv(A) @ y).T

# Display the results
print("\nRESULTS")
print("Intercept (b):\t", m_b[1])
print("Slope (m):\t", m_b[0])

<br><br>
***
Now you can perform linear regression using the python technique that you learned in the previous notebook.
The data variables to use are:
- X_train
- y_train
- X_test

In [None]:
# Performing linear regression in Python
from sklearn.linear_model import LinearRegression

# Create our linear regression model
model = LinearRegression()

# Next we fit our model using the training data
model.fit(X_train, y_train)

# Now we apply our model to the testing data
predictions = model.predict(X_test)

In [None]:
# Get the intercept and slope (see linear regression notebook)
b = model.intercept_
m = model.coef_

# Get the results
print("\nRESULTS")
print("Intercept (b): ", b)
print("Slope (m): ", m)

In [None]:
# Let's get the error value
from sklearn.metrics import r2_score

print(r2_score(y_test, predictions))

This shows our r-squared value.

***
## Visualization
***
Now we can plot the data and see if it looks correct.


In [None]:
# plot the results
plt.figure(figsize=(8, 6))
ax = plt.axes()
ax.scatter(X_test, y_test, color='black')
ax.plot(X_test, predictions, color='blue', linewidth=3)

ax.set_xlabel('K_ow')
ax.set_ylabel('BCF')

ax.axis('tight')

plt.show()

<br><br>
As we can see, our results are not great. If we had more data, this would improve our results.