# Importing the necessary dependencies and dataset(s):

In this task I am going to perform Supervised ML on the given **student_scores.csv** dataset, that plots the correlation between number of hours studied by a student and his/her score in the exam. 

I am using Linear Regression for this task.

Dependencies used:

- ```sklearn``` for the task's Supervised ML task
- ```numpy``` for manipulating data
- ```plotly.express``` for plotting interactive graphs
- ```pandas``` for storing and manipulating data

In [152]:
import pandas as pd
from sklearn.linear_model import LinearRegression
import plotly.express as px
import numpy as np

data = pd.read_csv("./student_scores - student_scores.csv")

# Visualizing the raw data

The next step involves visualizing the raw data to guess the type of algorithm we may require to properly model our data.

In [172]:
features = data.Hours
labels = data.Scores

# model.fit(features.reshape(-1,1),labels.reshape(-1,1))
px.scatter(x=features, y=labels, labels={
    "x": "Number of hours studied",
    "y": "Marks received in exam"
}
,title="Student performance chart")

# Modelling the data

From the given data, we see that it clearly follows a linear path (albeit with a few outliers). Hence, we can use Linear Regression model for a good but simple fit. So we use ```sklearn.linear_model.LinearRegression``` for the task's Linear Regression operation.

In [133]:
features.shape
labels.shape

model = LinearRegression()

model.fit(features.to_numpy().reshape(-1,1), labels.to_numpy().reshape(-1,1))
predicted = model.predict(features.to_numpy().reshape(-1,1))

In [134]:
featuresarr = features.to_numpy().reshape(25,1)
featuresarr = list(i[0] for i in featuresarr.tolist())
predicted = list(i[0] for i in predicted.tolist())

# Making predictions

In [135]:
print("Features, Labels")
for i in range(len(featuresarr)):
    print(f"{featuresarr[i]},      {predicted[i]}")

Features, Labels
2.5,      26.92318188234188
5.1,      52.34027069838929
3.2,      33.76624425589311
8.5,      85.57800222706669
3.5,      36.698985273129345
1.5,      17.147378491554413
9.2,      92.42106460061791
5.5,      56.250592054704285
8.3,      83.62284154890921
2.7,      28.878342560499377
7.7,      77.75735951443671
5.9,      60.16091341101927
4.5,      46.47478866391682
3.3,      34.743824594971855
1.1,      13.237057135239425
8.9,      89.48832358338169
2.5,      26.92318188234188
1.9,      21.057699847869397
6.1,      62.11607408917676
7.4,      74.82461849720048
2.7,      28.878342560499377
4.8,      49.40752968115306
3.8,      39.631726290365584
6.9,      69.93671680180674
7.8,      78.73493985351546


# Plotting the line of best fit

Now that we have the data of the predictions made by the model, we can plot it to visualize the line of best fit.

In [187]:
fig = px.scatter(x=featuresarr, y=predicted, color_discrete_sequence=["red"], labels={
    "x": "Number of hours studied",
    "y": "Marks received in exam"
}
,title="Student performance:")
fig.add_trace(px.line(x=featuresarr, y=predicted, color_discrete_sequence=["blue"]).data[0])

# Visualizing prediction accuracy

Now that we have found the line of best fit, let's see how the prediction data matches up with the actual data.

In [182]:
featuresarr.sort()
predicted.sort()
fig = px.scatter(x=featuresarr,y=predicted, color_discrete_sequence=["red"], labels={
    "x": "Number of hours studied",
    "y": "Marks received in exam"
}
,title="Student performance: Degree of fit")
fig.add_trace(px.scatter(x=features, y=labels).data[0])
fig.add_trace(px.line(x=featuresarr, y=predicted, color_discrete_sequence=["red"]).data[0])
fig.update_layout(showlegend=False)

# Scenario 1: Predicting the corresponding performance of the student, given hours of study per day.

Now that we are ready with our model, we can simply ask it to predict the marks a student might expect, given the number of hours he/she studies per day. 
In this case, the student wants to know how much marks he/she can expect on studying 9.25 hours per day.

In [190]:
fig = px.scatter(x=featuresarr,y=predicted, color_discrete_sequence=["red"], labels={
    "x": "Number of hours studied",
    "y": "Marks received in exam"
}
,title="Student performance: Predicting of new point (indicated in green)")
fig.add_trace(px.scatter(x=features, y=labels).data[0])
fig.add_trace(px.line(x=featuresarr, y=predicted, color_discrete_sequence=["red"]).data[0])
fig.update_layout(showlegend=False)
fig.add_trace(px.scatter(x=x.tolist(), y=[y[0][0]], color_discrete_sequence=["green"]).data[0])
x = np.array([9.25]).reshape(-1,1)
y = model.predict(x)
print("Expected marks=", y[0][0])
print("\n")
fig

Expected marks= 92.9098547701573




# Conclusion:

On conclusion, we can say that: "Given the current data, the student may expect around 93 marks given 9.25 hours of study per day"