Practicing the Art of Building a very Basic Machine Learning Model in Python - my first ever!

In [1]:
# Step 1: Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np

# Step 2: Create some sample data (hours studied vs. pass/fail)
hours_studied = np.array([2, 3, 4, 5, 6, 7, 8, 9, 10])
pass_fail = np.array([0, 0, 0, 0, 1, 1, 1, 1, 1])  # 0 indicates fail, 1 indicates pass

# Step 3: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(hours_studied.reshape(-1, 1), pass_fail, test_size=0.2, random_state=42)

## hours_studied.reshape(-1, 1): This reshapes the array to a format suitable for our machine learning model. Think of it as organizing the ingredients in a way the recipe (model) prefers.
## pass_fail: These are the labels indicating whether a student passed or failed.
## test_size=0.2: It means 20% of the data will be set aside for testing, while 80% will be used for training.
## random_state=42: This ensures that every time you run the code, the data splitting remains consistent. It's like using the same cookbook edition for a particular recipe.

# Step 4: Create a Logistic Regression model
model = LogisticRegression()

# Step 5: Train the model on the training data
model.fit(X_train, y_train)

# Step 6: Make predictions on the test data
predictions = model.predict(X_test)

## ^Uses the reserved ingredients in the test set (X_test) to see how well the Model performs. 

# Step 7: Evaluate the model's accuracy
accuracy = accuracy_score(y_test, predictions)

print(f"Model Accuracy: {accuracy * 100:.2f}%")

# Step 8: Display the results
print("Actual outcomes:", y_test)
print("Predicted outcomes:", predictions)

Model Accuracy: 100.00%
Actual outcomes: [1 0]
Predicted outcomes: [1 0]


# Simple Learning & Predicting Using Machine-Learning

In [15]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # package = "sklearn" (A package that comes with scikitlearn library), module = "tree", class = "DecisionTreeClassifier"

df = pd.read_csv(r"/Users/alijazibrizvi/Downloads/music.csv") # Gender of "0": Female; Gender of "1": Male
df

# Goal: to Predict Preference of Genre of a person based on Age & Gender

### Steps:

# 1) Split this Dataset into Two: One with the "Age" & "Gender" columns, the Other with the "Genre" column
##^ Whenever we Train a Model, we give it 2 Datasets: an "Input" set (in this case, the "Age" & "Gender" columns) and an "Output" (contains the predictions - in this case, the "Genre" column) set
###^ Once the Model is Trained, we give it a New Input set(s) to Predict from
###^ For example, after the initial training, give the Model an Input set of [21, 1] to Predict Favorite Genre of (Combination of "Age" & "Gender" not in Original Dataset but which the Model will Predict based on its earlier Training)

# Let's Split the Dataset
X = df.drop(columns = ["genre"]) # Removing the "Genre" column to Create the "Input" set
y = df["genre"] # Showing Only the "Genre" column to define it as the "Output" set

model = DecisionTreeClassifier() # Creating a New Instance of the "DecisionTreeClassifier" Class

## We Have the Model^. Now, we Need to Train it to Learn Patterns in the Data
model.fit(X, y) # Training the Model: This is How it will Learn the Patterns in the Data - model.fit(*input set*, *output set*)
predictions = model.predict([ [21, 1], [22, 0] ]) # Predict Favorite Genre for Each of the Two Input sets: a 21-year-old Male ("[21, 1]") and a 22-year-old Female ("[22, 0]")
predictions # There ya go! "HipHop" & "Dance" are the Respective (and quite Accurate, from the Data) Predictions!





array(['HipHop', 'Dance'], dtype=object)

# Measuring the Accuracy of a Model

In [76]:
# To Measure the Accuracy of your Model, you must Split your Dataset into Two: One for Training & the Other for Testing

# The More Complex the Problem is, the More Data we Need to Train the Model for Accuracy

## Generally: Allocate 70-80% of your Data for Training & the Remaining 20-30% for Testing
###^ Usually, the Higher the Test Size, the Lower your Model's Accuracy will be
###^ The Higher the Quantity and Cleanliness of Data we Give to the Model for Training, the Higher the Model's Accuracy should be
###^ Thus, it is Very Important to Clean our Data before Training & Testing our Model

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X = df.drop(columns = ["genre"])
y = df["genre"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.20) # We are Allocating 20% (test_size = 0.2) of our Data for Testing
##^ This Function Returns a Tuple; so, we will Unpack it into Four Variables ("X_train, X_test, ...")
###^ The first 2 variables ("X_train", "X_test") are the Input Sets for Training & Testing
###^ The last 2 variables ("y_train", "y_test") are the Output Sets for Training & Testing

model = DecisionTreeClassifier()
model.fit(X_train, y_train) # Here, Pass only the Training Dataset
predictions = model.predict(X_test) # Pass the Dataset ("Input set") that contains Input Values for Testing

# Calculating Accuracy of Model by Comparing the result of "predictions" to Actual Values of the Output Set for Testing ("y_test")
                             
score = accuracy_score(y_test, predictions) # The "score" will be Between 0 and 1
score # e.g., "1.0" = 100% Accurate!
## ^i.e., If you ask it to predict what Genre a 21-year-old Male [21, 1] would like, based on the Model's training, it will predict the Correct Genre 100% of the Time

# By the way, Each Time you Run this^ Accuracy Predictor, it might result in a Different Accuracy Value since the Dataset-Splitting Function Randomly Picks Data for Training & Testing

0.75

# Same ML Model^ but with the Addition of Updating the Model with New Data:

In [None]:
# Import necessary libraries
from sklearn.linear_model import LogisticRegression
from IPython.display import display, clear_output
import numpy as np
import time

# Create initial data

hours_studied = np.array([2, 3, 4, 5, 6, 7, 8, 9, 10])
pass_fail = np.array([0, 0, 0, 0, 1, 1, 1, 1, 1])

# Create and train the initial model

model = LogisticRegression()
model.fit(hours_studied.reshape(-1, 1), pass_fail)

# Function to update and visualize predictions

def update_model_and_predict(new_data):
    # Update the model with new data
    model.fit(new_data[:, 0].reshape(-1, 1), new_data[:, 1])
    
    # Make predictions on the updated model
    predictions = model.predict(hours_studied.reshape(-1, 1))
    
    # Visualize predictions
    clear_output(wait=True)
    display(predictions)
    time.sleep(2)  # Simulating real-time updates

# Simulate live data updates
new_data1 = np.array([[11, 1], [12, 1]])
update_model_and_predict(new_data1)

new_data2 = np.array([[3, 0], [4, 0]])
update_model_and_predict(new_data2)

## This example defines a function update_model_and_predict that takes in new data, updates the model, and visualizes the predictions. 
## The clear_output and time.sleep functions simulate the idea of real-time updates in a Jupyter Notebook environment.

In [None]:
# For Changing "Categorical" (Written in Letters) Data into Numeric Data for Regression Models, etc.

## Encoding Categorical data

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[:, 3] = labelencoder.fit_transform(X[:, 3]) # "X[:, 3]" Look for all rows (":,") in the X dataset but Edit only the 3rd ("... 3]") row
# Then, we set it equal to a Transformation ("labelencoder.fit_transform") for the same 3rd row ("[:, 3]"), Meaning that - for example - instead of a 'New York' it will have a 1 or 2 or 3, etc.


onehotencoder = OneHotEncoder(categorical_features = [3])
X = onehotencoder.fit_transform(X).toarray() # This Final Transformation preps our Data so that it is Completely Set as a Row of Numbers Only

## Ways of Connecting to Data Source AND/OR Continuously Updating Data:

### Connecting to a Data Source:

#### Databases (e.g., SQLite, mySQL, PostgreSQL):
Use a database connector like sqlite3, mysql-connector, or psycopg2 for respective databases.

In [None]:
import sqlite3

# Connect to SQLite database
conn = sqlite3.connect('your_database.db')

# Use pandas to read data from a SQL query
data = pd.read_sql_query('SELECT * FROM your_table', conn)

#### APIs:
Use the requests library to fetch data from APIs.

In [None]:
import requests

# Make a GET request to an API endpoint
response = requests.get('https://api.example.com/data')

# Parse JSON response
data = response.json()

### Continuously Updating Data:

### Periodic Fetching:
Schedule periodic fetches using a library like schedule to update your data at specified intervals.

In [None]:
import schedule
import time

def update_data():
    # Your data update logic here

# Schedule updates every hour
schedule.every().hour.do(update_data)

while True:
    schedule.run_pending()
    time.sleep(1)

### Streaming Data:
For streaming data, consider using libraries like pandas or websocket-client for WebSocket-based updates.

In [None]:
import pandas as pd
import websocket

def on_message(ws, message):
    # Your data processing logic for each streaming message
    update_data(message)

# Connect to a WebSocket for streaming updates
ws = websocket.WebSocketApp('wss://stream.example.com', on_message=on_message)
ws.run_forever()