_This Part of coding was executed in MySQL 8.0 environment_

# Step 1: Extracting relevant data for the prediction model.
We'll consider film features (e.g., title, release year, rating, length)
and rental features (e.g., rental rate, number of times rented, last rental date).










# Step 2: Creating Target Variable
The target variable will indicate whether a film was rented last month.
This requires knowing the most recent rental date for each film and checking if it falls within the last month.

# Step 3: Data Preprocessing

In [3]:
import pandas as pd
from datetime import datetime, timedelta



# Loading necessary CSV files into Pandas dataframes
films_df = pd.read_csv(r'C:\Users\wailb\Documents\Sakila Databse CSV\film.csv')
inventory_df = pd.read_csv(r'C:\Users\wailb\Documents\Sakila Databse CSV\inventory.csv')
rental_df = pd.read_csv(r'C:\Users\wailb\Documents\Sakila Databse CSV\rental.csv')


# Converting rental_date to datetime
rental_df['rental_date'] = pd.to_datetime(rental_df['rental_date'])

# Merging to simulate the join operation in SQL
merged_df = films_df.merge(inventory_df, on='film_id', how='left').merge(rental_df, on='inventory_id', how='left')

# Calculating features: number of rentals and last rental date for each film
film_features_df = merged_df.groupby('film_id').agg(
    title=('title', 'first'),
    release_year=('release_year', 'first'),
    rating=('rating', 'first'),
    length=('length', 'first'),
    rental_rate=('rental_rate', 'first'),
    number_of_rentals=('rental_id', 'count'),
    last_rental_date=('rental_date', 'max')
).reset_index()

# Displaying few rows of the dataframe
film_features_df.head()


Unnamed: 0,film_id,title,release_year,rating,length,rental_rate,number_of_rentals,last_rental_date
0,1,ACADEMY DINOSAUR,2006,PG,86,0.99,2,2005-05-30 20:21:07
1,2,ACE GOLDFINGER,2006,G,48,4.99,0,NaT
2,3,ADAPTATION HOLES,2006,NC-17,50,2.99,0,NaT
3,4,AFFAIR PREJUDICE,2006,G,117,2.99,2,2005-05-31 00:06:02
4,5,AFRICAN EGG,2006,G,130,2.99,1,2005-05-28 07:53:38


# Step 4: Picking the Target Variable
###            + Creating Predictive Model (Logistic Regression)
 
For the target variable, we'll determine if each film was rented last month.
This step involves comparing the 'last_rental_date' to the date range of the last month.

In [4]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Seting the current date to the latest date in the rental data for simulation
current_date = rental_df['rental_date'].max()

# Defining a date far in the past for films never rented
far_past_date = datetime(1900, 1, 1)

# Filling missing last_rental_date with a date far in the past
film_features_df['last_rental_date'].fillna(far_past_date, inplace=True)

# Creating the target variable: 1 if rented last month, 0 otherwise
film_features_df['rented_last_month'] = (film_features_df['last_rental_date'] >= (current_date - timedelta(days=30))).astype(int)

# Preparing for preprocessing
features_to_encode = ['rating']
features_to_scale = ['length', 'rental_rate', 'number_of_rentals']

# Preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), features_to_scale),
        ('cat', OneHotEncoder(), features_to_encode)
    ])

# Fitting and transform the data
X = film_features_df[features_to_scale + features_to_encode]
y = film_features_df['rented_last_month']
X_transformed = preprocessor.fit_transform(X)

# Checking the shape of the transformed features and the target variable
X_transformed.shape, y.shape


((1000, 8), (1000,))

# Step 5: Evaluating the model

In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_transformed, y, test_size=0.2, random_state=42)

# Initialize and train the logistic regression model
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

# Predict on the test set
y_pred = logreg.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

accuracy, report

(1.0,
 '              precision    recall  f1-score   support\n\n           0       1.00      1.00      1.00       173\n           1       1.00      1.00      1.00        27\n\n    accuracy                           1.00       200\n   macro avg       1.00      1.00      1.00       200\nweighted avg       1.00      1.00      1.00       200\n')

In [6]:
## Accuracy: 100%
## Precision, Recall, and F1-Score: 100% for both classes (rented last month vs. not rented)

This might indicate an overly simplified model, especially in a real-world scenario where achieving 100% accuracy is highly unlikely.

The simulation based on the latest date in the rental data as "current" might have contributed to this outcome.


# Final Step: Using The Model

In [9]:
# Let's select a random sample from our original dataset to predict on
sample_film_data = film_features_df.sample(n=1, random_state=42)

# Extracting the relevant features for prediction (ensure the order matches the model's training data)
sample_features = sample_film_data[features_to_scale + features_to_encode]

# Preprocessing the sample film data
sample_features_transformed = preprocessor.transform(sample_features)

# Predicting the rental likelihood for the sample film
predicted_rental_sample = logreg.predict(sample_features_transformed)

# Preparing the prediction result for display
prediction_result = "likely to be rented next month" if predicted_rental_sample[0] == 1 else "unlikely to be rented next month"

# Displaying the film's title and the prediction result
sample_film_data['title'].values[0], prediction_result


('LIFE TWISTED', 'unlikely to be rented next month')