<a href="https://colab.research.google.com/github/ddaviddn/4-beginner-dudes/blob/master/Full_Simple_Linear_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# An Introduction to Simple Linear Regression in Python
**For the first statistical python project, we are running a full simple linear regression analysis, from using these imported libraries and modules, as well as basic python concepts.**

(Using the amazing sklearn: A Machine Learning Library)

**Importing the necessary libraries and modules**

Why do we need these specific libraries/modules?



*   **NumPy** - This is a popular and essential library for working with any types of data set. This simplifies all the work when we're dealing with multi-dimensional arrays and matrices. It also makes operations on these arrays a breeze.

*   **pandas** - Another popular and essential library for dealing with data. Similar to NumPy, this simplifies and offers many different data manipulation and data analysis tools.

*   **Matplotlib** - This is a comprehensive library for creating static, animated, and interactive plots. This is well-known for transforming simple syntax into full and digestible data visualizations.

*   **Sklearn** - This is a module for machine learning built on SciPy and integrates classical machine learning algorithms. Features varying regression, classification, clustering algorithms, etc. 

*   **Seaborn** - A data visualization library based off matplotlib. It provides high-level visualizations. Creating more aesthetically pleasing and informative statistical graphics. A passive library, not specifically calling it but an addendum to matplotlib visualizations.



In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import f_regression

### Optional, a passive library mainly for
import seaborn as sns
sns.set()

**Creating the simple linear regression function with the parameters**

Using a mixture of these previously imported libraries and modules, we are creating a function which will output all that is necessary to run an analysis.



In [0]:
def simple_regression(path,x_name,y_name):

  data = pd.read_csv(path)

  print(data.head())

  x = data[x_name]
  y = data[y_name]

  x_matrix = x.values.reshape(-1,1)
  reg = LinearRegression()
  reg.fit(x_matrix,y)

  r2 = reg.score(x_matrix,y)

  print("\nThe R-sqaured value is " + str(round(r2,3)) +".\n")

  if r2 >= 0.8:
    print("Since our R-sqaured value is significantly high, "+str(x_name)+
          " is a good predictor of "+ str(y_name)+".")
  elif r2 >= 0.5:
    print("Since our R-sqaured value is moderate, "+str(x_name)+
          " is a decent predictor of "+ str(y_name)+
          ". Further analysis is needed. Maybe try different variables.")
  else:
    print("Since our R-sqaured value is low, "+str(x_name)+
          " is not the best predictor of "+ str(y_name)+
          ". Further analysis is needed. Maybe try different variables.")

  adj = 1-(1-r2)*(x_matrix.shape[0]-1)/(x_matrix.shape[0]-x_matrix.shape[1]-1)

  print("\nThe adjusted R-sqaured value is " + str(round(adj,3)) +".\n")

  coef = reg.coef_
  intercept = reg.intercept_

  yhat = intercept + coef*x_matrix

  print('y = '+str(round(intercept,3))+' + '+str(coef.round(3))+' * ' + x_name)
 

  p = f_regression(x_matrix,y)[1].round(5)

  # print('\nThe p-value for the ' + x_name +' variable is '+ str(p))

  reg_summary = pd.DataFrame(data=[x_name],columns = ['Features'])
  reg_summary ['Coefficients'] = coef
  reg_summary ['P-Values'] = p

  print("\n" + str(reg_summary))
  print("\n")
  plt.scatter(x,y, alpha = 0.8)
  plt.xlabel(x_name, fontsize = 20)
  plt.ylabel(y_name, fontsize = 20)
  fig = plt.plot(x, yhat, c='green', label = 'Regression')
  plt.show()

  return 'Done'

**Creating another function for future estimations using our model**

This is a function where the given SAT input(s) will output the estimated GPA for the student(s). For readability purposes, we are outputting as a table of values.


In [0]:
def prediction(path, values):

  data = pd.read_csv(path)

  x = data.iloc[:,:-1].values
  y = data.iloc[:,-1].values

  reg = LinearRegression()
  reg.fit(x,y)

  intercept = reg.intercept_
  coef = reg.coef_

  a = intercept + coef*values

  examples = pd.DataFrame({'X1':values})
  examples = examples[['X1']]

  pred = pd.DataFrame({"Constant":'True',"Predicted value":a}) 

  full = examples.join(pred)

  return full

# REAL EXAMPLE:

# After we've created our personalized regression functions, let's try to use our functions with a real data set



**Data Description**


> This is a simple and intuitive data set about two variables, 'SAT' and 'GPA' for high school students. There are 84 counts of GPA's of random students and their corresponding SAT scores. This will be a sufficient enough data set to drive the point home. 

  For example, the first row shows that SAT = 1714 and GPA = 2.40. In words, this translates to the first student scoring a 1714 on the SAT and also obtaining a GPA of 2.40. 


In [0]:
data = pd.read_csv('grades.csv')

print(data.describe())
data

# Finally applying our simple linear regression function that we have created.

The parameters are ones that we defined originally for the simple regression function. Inputting the name of the file, the independent variable, and the dependent variable.

If the parameters were correctly inputted, the output we should be obtaining is a full simple regression analysis as well as a lovely visualization of the scatter plot.

In [0]:
simple_regression('grades.csv','SAT','GPA')


# Predicting new values using our prediction function

After running the analysis, another thought that comes to mind is trying to predict values using our regression model. Let's try to predict GPA given an array of SAT scores of these 4 imaginary high school students.

In [0]:
predict_SAT = np.array([1500,1600,1700,1800])

prediction('grades.csv', predict_SAT)

Unnamed: 0,X1,Constant,Predicted value
0,1500,True,2.758572
1,1600,True,2.924141
2,1700,True,3.08971
3,1800,True,3.255279


# Hopefully you enjoyed this walkthough of the simple linear regression in Python :) 

Give yourself a pat on the back, you deserved it. This is a good first project to dip your toes into machine learning and statistical analysis with Python. Feel free to save a copy and edit the code yourself :)


By: David Nguyen

![Congratulations!](https://contenthub-static.grammarly.com/blog/wp-content/uploads/2019/04/thumbnail-7075f02d50b2e1b87acaac02e0592003.jpeg)