# Python: Scikit-Learn 

## Course Skip Quiz

The following questions are aimed at testing your understanding of the content that is covered within this course. There is no defined threshold where we believe you should attend the course if you score below. Rather it is aimed to make you engage with the content and reflect for yourself if you feel you would benefit from attending the course. Of note, is that the quiz is intended for you to use google throuhgout and engage with documentation. Even if you get all of the questions right, you are ofcourse more than welcome to still attend the course and use it as a refresher!

In [None]:
from jupyterquiz import display_quiz
display_quiz("../questions/course_skip_quizes/scikitlearn_course_skip_quiz.json")

## Course

In [None]:
# import the packages used before and read in the required data
import pandas as pd
import matplotlib.pyplot as plt
air_pollution_data_2023_complete_dataset = pd.read_csv("../data/LEED_air_pollution_monitoring_station_2023_complete_dataset.csv", index_col=0)
air_pollution_data_2023_complete_dataset = air_pollution_data_2023_complete_dataset.dropna()

In [None]:
# scikit learn can be imported with the the following command
import sklearn

## What is Scikit-Learn?

Scikit-Learn is a popular Python package that provides a set of algorithms and tools for machine learning that are both easy to use and effective. The package includes support for various tasks, including classification, regression, clustering, dimensionality reduction and model selection and normalization.




In [None]:
air_pollution_data_2023_complete_dataset["date"] = pd.to_datetime(air_pollution_data_2023_complete_dataset["date"], format="%d/%m/%Y %H:%M")
air_pollution_data_2023_complete_dataset["Hour"] = air_pollution_data_2023_complete_dataset["date"].dt.hour

In [None]:
display(air_pollution_data_2023_complete_dataset.head())

In [None]:
### Linear Regression

from sklearn.linear_model import LinearRegression
import numpy as np

# Create some simple data
X = air_pollution_data_2023_complete_dataset[["Wind Speed"]]
y = air_pollution_data_2023_complete_dataset[["NO2"]]

# Train a linear regression model
model = LinearRegression()
model.fit(X, y)

# Plotting the data points
plt.scatter(X, y, color='blue', label='Observations', alpha=0.05)

# Predicting the values to draw the regression line
# We use minimum and maximum values of X to cover the whole range of data
X_new = np.linspace(X.min(), X.max(), 100).reshape(-1, 1)  # Making it a column vector
y_predict = model.predict(X_new)

# Plotting the regression line
plt.plot(X_new, y_predict, color='red', linewidth=2, label='Regression Line')

# Adding title and labels
plt.title('Relationship Between Wind Speed and NO$_2$ Levels')
plt.xlabel('Wind Speed m/s')
plt.ylabel('NO$_2$ Concentration')
plt.legend()

# Show the plot
plt.show()


# Extracting and printing the intercept and slope
intercept = model.intercept_
slope = model.coef_

print("Intercept of the regression line:", intercept[0])  # Intercept is usually an array with a single element
print("Slope of the regression line:", slope[0][0])  # Slope is an array of arrays, each containing one element per feature




In [None]:
from jupyterquiz import display_quiz
display_quiz("../questions/scikitlearn_question_linear_regression_question.json")

In [None]:
from sklearn.cluster import KMeans
import numpy as np

# Set the number of clusters
k = 3  # Example number of clusters

# Create KMeans model
kmeans = KMeans(n_clusters=k, random_state=0)

# Fit the model
clusters = kmeans.fit_predict(air_pollution_data_2023_complete_dataset[['NO2', 'Temperature']])



In [None]:
# Adding the cluster information to the DataFrame
air_pollution_data_2023_complete_dataset['cluster'] = clusters

# Plotting clusters
plt.figure(figsize=(10, 6))
scatter = plt.scatter(air_pollution_data_2023_complete_dataset['NO2'], air_pollution_data_2023_complete_dataset['Wind Speed'], c=air_pollution_data_2023_complete_dataset['cluster'], cmap='viridis', alpha=0.15)
plt.title('Clusters of Air Pollution Data')
plt.xlabel('NO2 Concentration')
plt.ylabel('Wind Speed')
plt.colorbar(scatter)
plt.show()


In [None]:
# Decision Tree Models 


from sklearn.tree import DecisionTreeRegressor

# Initialize the model
decision_tree_regressor = DecisionTreeRegressor(random_state=42, max_depth=2)

# Train the model on the entire dataset
decision_tree_regressor.fit(air_pollution_data_2023_complete_dataset[["Wind Speed"]], air_pollution_data_2023_complete_dataset["NO2"])


import matplotlib.pyplot as plt
from sklearn.tree import plot_tree

# Plot the decision tree
plt.figure(figsize=(50,10))
plot_tree(decision_tree_regressor, feature_names=['Wind Speed'], filled=True)
plt.show()


In [None]:
display_quiz("../questions/sciktilearn_question_decision_tree.json")

## Decomposition

Decomposition refers to the process of breaking down a complex problem into smaller, simpler sub-problems. The idea is to solve each subproblem independently 
and then combine the solutions to obtain a solution to the original problem. This allows for the creation of more efficient and manageable algorithm by 
reducing the complexity of the problem and making it easier to identify and solve individual parts. Equally, this makes your code more reusable as each function 
solves a smaller issue which may be reusable elsewhere. 

When breaking the problem down it can be helpful to first of all identify the 
* starting point
    * what input do you have
    * what are the initial conditions
* end point. 
    * what output would you ultimately want
    * what is the end goal 

You can then start to fill in the gaps in between. Through the process of building your algorithm you may find that your start point isn't actually the start point, you need to go further back in the process. Similarly you might find your end goal needs to be redefined.

For example in our tube navigation example crucially we are starting at Paddington TUBE Station, but our passenger is at Paddington RAIL station. Our algorithm is meaningless is they can't get to the TUBE station, so we need to add an additional step to navigate from the TRAIN station to the TUBE station. It is important to remember, that your computer knows absolutely nothing at the beginning of a new program you need to tell it everything. 

![toEdgware](../individual_modules/computational_thinking/images/directions.jpg)

### Activity: Caesar cypher 

In cryptography, a Caesar cipher is a very simple encryption technique in which each letter in the plain text is replaced by a letter some fixed number of positions down the alphabet. For example, with a shift of 3, A would be replaced by D, B would become E, and so on. The method is named after Julius Caesar, who used it to communicate with his generals. ROT-13 ("rotate by 13 places") is a widely used example of a Caesar cipher where the shift is 13. 

Your task in this exercise is to design the algorithm that a computer would need to follow to encode/decode a message using ROT-13.  How does the idea of decomposition apply here? What would you do to implement this? 

It might be helpful to think through how you would manually decode the following message:

```
Pnrfne pvcure? V zhpu cersre Pnrfne fnynq!
```



Credits: [Torbjorn Lager](https://www.gu.se/en/about/find-staff/torbjornlager)

Note: Because there are 26 letters (2×13) in the basic Latin alphabet, ROT13 is its own inverse; that is, to undo ROT13, the same algorithm is applied, so the same action can be used for encoding and decoding.

