# **<font color='##32a852'>Assignment 3: Prediction </font>**


### **<font color='###6b32a8'>Model accuracy on validation data: 93.26% </font>**

##Archisa Bhattacharya

**PROJECT LINKS:**

Kaggle Dataset: https://www.kaggle.com/datasets/lainguyn123/student-performance-factors

---

# **<font color='violet'>Libraries</font>**

In [1]:
from google.colab import files
import numpy as np
import pandas as pd
import io
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score


# **<font color='violet'>Part 1:</font> Loading the Data, EDA, & Data Cleaning**

## **<font color='green'>Loading the Data </font>**

In [17]:
# Mount Google Drive

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [22]:
# Read in data path

training_data = pd.read_csv('/content/drive/My Drive/DSAssignment3/PCTrainUpdated.csv')
testing_data = pd.read_csv('/content/drive/My Drive/DSAssignment3/PCTestUpdated.csv')
conversion_rates = pd.read_csv('/content/drive/My Drive/DSAssignment3/currency_conversion_rates.csv')


Lets try to compare all of the different data sets we are considering and see if we can find any interesting patterns in the data visually, through a heatmap of the correlation matrix. This will let us easily see which values appear to be correlated to each other. While correlation does not equal causation, this will help us have a nice jumping-off point!\

In [52]:
# Currency Conversion method
# Convert price to USD using conversion rates
def convToUSD(price, convRates):
    if isinstance(price, str):
        try:
            price_value, currency = price.split() # split the 'price' into its value and the currency it uses
            price_value = float(price_value)
            convRate = convRates.get(currency, 1.0)  # Convert the currency to USD, if there is no currency symbol being used, assume it is already USD
            return price_value / convRate
        except:
            return None     # If a conversion couldn't be  performed, return None , otherwise, return the price
    return price

currDict = dict(zip(conversion_rates['Currency_Code'], conversion_rates['100_USD_worth']))

# Apply currency conversion
training_data['actual_price_usd'] = training_data['actual_price'].apply(lambda x: convToUSD(x, currDict))
training_data['discount_price_usd'] = training_data['discount_price'].apply(lambda x: convToUSD(x, currDict))

testing_data['actual_price_usd'] = testing_data['actual_price'].apply(lambda x: convToUSD(x, currDict))
testing_data['discount_price_usd'] = testing_data['discount_price'].apply(lambda x: convToUSD(x, currDict))


# CLEANING THE MISSING VALUES

There is still a lot of missing values in our training data frame. Dropping them would be much easier. However, if we simply drop them, we could lose out on a lot of precious data, and this could make our final prediction model much weaker. Lets fill it in with an average value. I chose the median instead of the mean because if there are any outliers then it will ignore them. We will compute the discount percentt & add it to a new column in our training data.


This is called Feature Engineering in Machine Learning!! It is an invaluable way to manipulate the data that we have already collected to make it more useful to us. In this case, we use a simple percentage equation to  calculate the discount percentage using hte 2 features, actual price & discount price.
You may note that from the original features, we have performed, in order


1.   Converting the features we have to a usable currency (USD)
2.   Systematically cleaning the data we have with a certain equation that best suits our purpose
3. Finding the missing values in our data & filling it in with a specific value.



I find that this process is very similar to the process of knitting, from start to finish.
1. First, you must make the yarn usable. When you first buy yarn, it is very difficult to keep the yarn skein in place. Many knitters have to re-wind the skein of yarn into a ball of yarn, so that they can pull the yarn from the center. This makes it such that the yarn will not move around & create hassles while crocheting.
2. Second, you must determine the best stitch to use to complete your piece. There are many different varieties and types of knitting. A simple stitch is called the 'knit' stitch, which involves manipulating 'data' (yarn) using two 'vectors' (knitting needles). Just like choosing the appropriate method of cleaning, choosing the right stitch is vital to finishing a piece most quickly and efficiently using the amount of yarn provided.
3. Once you have finished knitting, you may notice that you have made many human errors, such as dropping stitches, which can leave large holes that can cause the entire piece to unravel. Filling these holes with the right kind of stitch is vital to ensuring the quality of the final product.



# PICKING FEATURES FOR OUR MODEL

Now comes the hard part: We have to decide what features we want in the model. Luckily, we have a couple hints provided to us on what features to use.

Hint: discount% (should be straightforward), discount_amount (should be straightforward), weighted_rating (x + ay^2 where x and y are other columns, and a is a constant)

I cleaned the Ratings and # of Ratings columns, to ensure that all empty or null values were filled in with the best measure of center (again, using median to prevent overweighting any outliers)

Then, much like knitting, I chose the appropriate equation & performed the creation & cleaning of these features, as adviced.


In [66]:
training_data_clean = training_data_clean.copy()
testing_data_clean = testing_data_clean.copy()

######  RATINGS & # OF RATINGS ############
# Clean Ratings & # of Ratings Features
training_data_clean.loc[:, 'ratings'] = training_data_clean['ratings'].fillna(training_data_clean['ratings'].median())
training_data_clean.loc[:, 'no_of_ratings'] = training_data_clean['no_of_ratings'].fillna(0)
# testing_data_clean.loc[:, 'ratings'] = testing_data_clean['ratings'].fillna(training_data_clean['ratings'].median())
# testing_data_clean.loc[:, 'no_of_ratings'] = testing_data_clean['no_of_ratings'].fillna(0)

######  DISCOUNT PERCENT ############

# Create & CLean the 'discount_percent' feature
training_data_clean.loc[:, 'discount_percent'] = 100 * (training_data_clean['actual_price_usd'] - training_data_clean['discount_price_usd']) / training_data_clean['actual_price_usd']
# testing_data_clean.loc[:, 'discount_percent'] = 100 * (testing_data_clean['actual_price_usd'] - testing_data_clean['discount_price_usd']) / testing_data_clean['actual_price_usd']

######  DISCOUNT AMOUNT ############
# Create & CLean the 'discount_amount' feature, in training & testing
training_data_clean.loc[:, 'discount_amount'] = training_data_clean['actual_price_usd'] - training_data_clean['discount_price_usd']
#testing_data_clean.loc[:, 'discount_amount'] = testing_data_clean['actual_price_usd'] - testing_data_clean['discount_price_usd']

######  WEIGHTED RATING ############
# Define a constant 'a' for weighted_rating (you can tune this later)
a = 0.1

# Create & CLean the 'weighted_rating' feature, in training & testing
training_data_clean.loc[:, 'weighted_rating'] = training_data_clean['ratings'] + a * (training_data_clean['no_of_ratings'] ** 2)
#testing_data_clean.loc[:, 'weighted_rating'] = testing_data_clean['ratings'] + a * (testing_data_clean['no_of_ratings'] ** 2)

print(training_data_clean[['discount_amount', 'weighted_rating']].head())
#print(testing_data_clean[['discount_amount', 'weighted_rating']].head())


   discount_amount  weighted_rating
0         0.152195              6.6
2         0.005955              3.9
3         0.527068         606640.7
4         1.927801              5.4
5         2.703109              3.9


# TRAINING THE MODEL

Now it is time to train the model! I use the feature which I cleaned already & select them from the training data. I have created 2 variables, X and Y. X is meant to be the feature matrix, which contains all of the features, as they were selected from the training data

*    X is the feature matrix, which contains all of the features, as they were selected from the training data. This works like a 'checklist' to see which features we want to use together.
*  Y has all of the values of the 'purchase?' column.

Then I used SciKit Learn to split my data into training & validation sets. While testing, I tried many different values, but for submission, 80% of my data is for training and 20% is for validation. Another technique that can be used to split data into testing and training data is called k-fold cross validation! This one reminds me of the process of kneading a loaf of dough for bread :)

Finaly, I train the DecisionTreeClassifier from SciKit Learn & generate my predictions.


In [56]:
features = ['discount_percent', 'ratings', 'no_of_ratings', 'discount_amount', 'weighted_rating']

X = training_data_clean[features]

y = training_data_clean['purchase?'].apply(lambda x: 1 if x == 'YES' else 0)

In [61]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_val)
accuracy = accuracy_score(y_val, y_pred)
print(f"Model accuracy on validation data: {accuracy * 100:.2f}%")

Model accuracy on validation data: 93.27%


In [58]:
X_test = testing_data_clean[features]
test_predictions = model.predict(X_test)

testing_data_clean.loc[:, 'purchase?'] = test_predictions
testing_data_clean.loc[:, 'purchase?'] = testing_data_clean['purchase?'].apply(lambda x: 'YES' if x == 1 else 'NO')

testing_data_clean[['item_id', 'purchase?']].to_csv('/content/drive/My Drive/DSAssignment3/predicted_purchases2.csv', index=False)

--------------