### Import Cleaned E-commerce Dataset
The csv file named 'cleaned_ecommerce_dataset.csv' is provided. You may need to use the Pandas method, i.e., `read_csv`, for reading it. After that, please print out its total length.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np

# Load CSV file
data = pd.read_csv('C:/Users/Wassim/cleaned_ecommerce_dataset.csv') 

# Since we don't have a specified target, let's use 'rating' as a placeholder
features = data.drop('rating', axis=1) 
target = data['rating']

# Split the data

# Case 1: 10% Training, 90% Testing 
X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(
    features,  
    target,                
    test_size=0.9,          
    random_state=42           
)

print("Case 1:")
print(f"Training Set Shape: {X_train_1.shape}")
print(f"Testing Set Shape: {X_test_1.shape}")

# Case 2: 90% Training, 10% Testing
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(
    features, 
    target,            
    test_size=0.1,           
    random_state=42           
)

print("\nCase 2:")
print(f"Training Set Shape: {X_train_2.shape}")
print(f"Testing Set Shape: {X_test_2.shape}")


Case 1:
Training Set Shape: (268, 10)
Testing Set Shape: (2417, 10)

Case 2:
Training Set Shape: (2416, 10)
Testing Set Shape: (269, 10)


## Analysis of an E-commerce Dataset Part 2

The goal of the second analysis task is to train linear regression models to predict users' ratings towards items. This involves a standard Data Science workflow: exploring data, building models, making predictions, and evaluating results. In this task, we will explore the impacts of feature selections and different sizes of training/testing data on the model performance. We will use another cleaned combined e-commerce sub-dataset that **is different from** the one in “Analysis of an E-commerce Dataset” task 1.

In [2]:
import pandas as pd

# Read the CSV file
data = pd.read_csv('C:/Users/Wassim/Downloads/cleaned_ecommerce_dataset.csv')

# Get the total length (number of rows)
total_length = len(data)

# Print the result
print("The total number of rows in the dataset is:", total_length)

The total number of rows in the dataset is: 2685


### Explore the Dataset

* Use the methods, i.e., `head()` and `info()`, to have a rough picture about the data, e.g., how many columns, and the data types of each column.
* As our goal is to predict ratings given other columns, please get the correlations between helpfulness/gender/category/review and rating by using the `corr()` method.

  Hints: To get the correlations between different features, you may need to first convert the categorical features (i.e., gender, category and review) into numerial values. For doing this, you may need to import `OrdinalEncoder` from `sklearn.preprocessing` (refer to the useful exmaples [here](https://pbpython.com/categorical-encoding.html))
* Please provide ___necessary explanations/analysis___ on the correlations, and figure out which are the ___most___ and ___least___ corrleated features regarding rating. Try to ___discuss___ how the correlation will affect the final prediction results, if we use these features to train a regression model for rating prediction. In what follows, we will conduct experiments to verify your hypothesis.

In [3]:
import pandas as pd


# View the first few rows
print(data.head())

# Summary of the data
print(data.info())



   userId  timestamp                                           review  \
0    4081      71900                                Not always McCrap   
1    4081      72000  I dropped the chalupa even before he told me to   
2    4081      72000                     The Wonderful World of Wendy   
3    4081     100399                             They actually did it   
4    4081     100399                             Hey! Gimme some pie!   

                                 item  rating  helpfulness gender  \
0                          McDonald's     4.0          3.0      M   
1                           Taco Bell     1.0          4.0      M   
2                             Wendy's     5.0          4.0      M   
3  South Park: Bigger, Longer & Uncut     5.0          3.0      M   
4                        American Pie     3.0          3.0      M   

                category  item_id  item_price  user_city  
0  Restaurants & Gourmet       41       30.74          4  
1  Restaurants & Gourmet    

In [4]:
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder

# Create a sample DataFrame with categorical and numerical features
data = pd.DataFrame({
    'helpful': [3, 2, 5, 1, 4],
    'gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
    'category': ['Electronics', 'Clothing', 'Electronics', 'Clothing', 'Electronics'],
    'review': ['Positive', 'Negative', 'Positive', 'Negative', 'Neutral'],
    'rating': [4, 3, 5, 1, 4]
})

# Identify categorical columns
categorical_cols = ['gender', 'category', 'review']  

# Create an OrdinalEncoder instance
encoder = OrdinalEncoder()

# Fit the encoder to the categorical columns and transform  
data[categorical_cols] = encoder.fit_transform(data[categorical_cols])

# Calculate correlations
correlations = data.corr()

# Focus on correlations with the 'rating' column
rating_correlations = correlations['rating']

print(rating_correlations)

helpful     0.938315
gender      0.842701
category    0.842701
review      0.824226
rating      1.000000
Name: rating, dtype: float64


In [5]:
# Calculate correlations
correlations = data.corr()

# Focus on correlations with the 'rating' column
rating_correlations = correlations['rating']

print(rating_correlations)

# Most correlated feature
print("Most correlated feature:", rating_correlations.idxmax())

# Least correlated feature
print("Least correlated feature:", rating_correlations.idxmin())

helpful     0.938315
gender      0.842701
category    0.842701
review      0.824226
rating      1.000000
Name: rating, dtype: float64
Most correlated feature: rating
Least correlated feature: review


### Split Training and Testing Data
* Machine learning models are trained to help make predictions for the future. Normally, we need to randomly split the dataset into training and testing sets, where we use the training set to train the model, and then leverage the well-trained model to make predictions on the testing set.
* To further investigate whether the size of the training/testing data affects the model performance, please random split the data into training and testing sets with different sizes:
    * Case 1: training data containing 10% of the entire data;
    * Case 2: training data containing 90% of the entire data.
* Print the shape of training and testing sets in the two cases.

In [6]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset
df = pd.read_csv(r'C:\Users\Wassim\cleaned_ecommerce_dataset.csv')

# Assuming 'rating' is the target variable
X = df.drop('rating', axis=1)
y = df['rating']

# Case 1: Training data containing 10% of the entire data
X_train_case1, X_test_case1, y_train_case1, y_test_case1 = train_test_split(X, y, test_size=0.90, random_state=42)

# Case 2: Training data containing 90% of the entire data
X_train_case2, X_test_case2, y_train_case2, y_test_case2 = train_test_split(X, y, test_size=0.10, random_state=42)

# Print the shapes of training and testing sets for both cases
print("Case 1: Training data (10% of entire data)")
print("Training set shape:", X_train_case1.shape)
print("Testing set shape:", X_test_case1.shape)

print("\nCase 2: Training data (90% of entire data)")
print("Training set shape:", X_train_case2.shape)
print("Testing set shape:", X_test_case2.shape)


Case 1: Training data (10% of entire data)
Training set shape: (268, 10)
Testing set shape: (2417, 10)

Case 2: Training data (90% of entire data)
Training set shape: (2416, 10)
Testing set shape: (269, 10)


In [7]:
from sklearn.model_selection import train_test_split
import numpy as np

# Assuming you have your dataset stored in X (features) and y (target variable)

# Case 1: Training data containing 10% of the entire data
X_train_case1, X_test_case1, y_train_case1, y_test_case1 = train_test_split(X, y, test_size=0.90, random_state=42)

# Case 2: Training data containing 90% of the entire data
X_train_case2, X_test_case2, y_train_case2, y_test_case2 = train_test_split(X, y, test_size=0.10, random_state=42)

# Print the shape of training and testing sets for both cases
print("Case 1:")
print("Training set shape:", X_train_case1.shape, y_train_case1.shape)
print("Testing set shape:", X_test_case1.shape, y_test_case1.shape)
print("\nCase 2:")
print("Training set shape:", X_train_case2.shape, y_train_case2.shape)
print("Testing set shape:", X_test_case2.shape, y_test_case2.shape)


Case 1:
Training set shape: (268, 10) (268,)
Testing set shape: (2417, 10) (2417,)

Case 2:
Training set shape: (2416, 10) (2416,)
Testing set shape: (269, 10) (269,)


In [8]:
X_train_1.dtypes

userId           int64
timestamp        int64
review          object
item            object
helpfulness    float64
gender          object
category        object
item_id          int64
item_price     float64
user_city        int64
dtype: object

In [9]:
X_train_1.describe(include='all')

Unnamed: 0,userId,timestamp,review,item,helpfulness,gender,category,item_id,item_price,user_city
count,268.0,268.0,268,268,268.0,268,268,268.0,268.0,268.0
unique,,,268,73,,2,9,,,
top,,,"Something was deep, but it wasn't the sea",All Advantage,,F,Movies,,,
freq,,,1,11,,142,126,,,
mean,4735.865672,59059.708955,,,3.91791,,,45.660448,81.229813,21.130597
std,3502.071884,36710.153179,,,0.275015,,,26.42154,42.099419,11.527273
min,4.0,10100.0,,,3.0,,,0.0,12.0,0.0
25%,1393.0,22100.25,,,4.0,,,23.5,49.0,11.0
50%,4979.5,53000.5,,,4.0,,,46.5,69.0,23.0
75%,7651.0,90500.0,,,4.0,,,68.25,126.5,31.0


In [10]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

# Step 1: Load Dataset
data = pd.read_csv('C:/Users/Wassim/Downloads/cleaned_ecommerce_dataset.csv')

# Step 2: Split  Data (Case 1: 10% Training, 90% testing) 
X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(
    data, data['rating'], test_size=0.9, random_state=42
)

# Step 3: Model Training and Evaluation (Case 1)
model_1 = LinearRegression()
model_1.fit(X_train_1, y_train_1)  
y_pred_1 = model_1.predict(X_test_1)

r2_1 = r2_score(y_test_1, y_pred_1)
mse_1 = mean_squared_error(y_test_1, y_pred_1)

print("Case 1 Results (10% Training Data):")
print('R-squared:', r2_1)
print('Mean Squared Error:', mse_1)

# Step 4: Split  Data (Case 2: 90% Training, 10% Testing) 
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(
    data, data['rating'], test_size=0.1, random_state=42
)

# Step 5: Model Training and Evaluation (Case 2)
model_2 = LinearRegression()
model_2.fit(X_train_2, y_train_2) 
y_pred_2 = model_2.predict(X_test_2)

r2_2 = r2_score(y_test_2, y_pred_2)
mse_2 = mean_squared_error(y_test_2, y_pred_2)

print("\nCase 2 Results (90% Training Data):")
print('R-squared:', r2_2)
print('Mean Squared Error:', mse_2)


ValueError: could not convert string to float: "Something was deep, but it wasn't the sea"

### Train Linear Regression Models with Feature Selection under Cases 1 & 2
* When training a machine learning model for prediction, we may need to select the most important/correlated input features for more accurate results.
* To investigate whether feature selection affects the model performance, please select two most correlated features and two least correlated features from helpfulness/gender/category/review regarding rating, respectively.
* Train four linear regression models by following the conditions:
    - (model-a) using the training/testing data in case 1 with two most correlated input features
    - (model-b) using the training/testing data in case 1 with two least correlated input features
    - (model-c) using the training/testing data in case 2 with two most correlated input features
    - (model-d) using the training/testing data in case 2 with two least correlated input features
* By doing this, we can verify the impacts of the size of traing/testing data on the model performance via comparing model-a and model-c (or model-b and model-d); meanwhile the impacts of feature selection can be validated via comparing model-a and model-b (or model-c and model-d).    

In [11]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer

# Assuming 'rating' is the target variable
X = df.drop('rating', axis=1)
y = df['rating']

# Split the data into training and testing sets for both cases
X_train_case1, X_test_case1, y_train_case1, y_test_case1 = train_test_split(X, y, test_size=0.90, random_state=42)
X_train_case2, X_test_case2, y_train_case2, y_test_case2 = train_test_split(X, y, test_size=0.10, random_state=42)

# Define preprocessing steps for categorical, text, and numerical features
categorical_features = ['gender', 'category']
text_features = ['review']
numerical_features = ['helpfulness']

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(), categorical_features),
        ('text', CountVectorizer(), 'review'),
        ('num', StandardScaler(), numerical_features)
    ])

# Define a function to train linear regression models
def train_linear_regression_model(X_train, X_test, y_train, y_test, features):
    # Preprocess the data
    X_train_processed = preprocessor.fit_transform(X_train)
    X_test_processed = preprocessor.transform(X_test)
    
    # Initialize and train the linear regression model
    model = LinearRegression()
    model.fit(X_train_processed, y_train)

    # Make predictions
    y_pred = model.predict(X_test_processed)

    # Calculate mean squared error
    mse = mean_squared_error(y_test, y_pred)

    return model, mse

# Train models for each case and feature selection
model_a_case1, mse_a_case1 = train_linear_regression_model(X_train_case1, X_test_case1, y_train_case1, y_test_case1, X_train_case1.columns)
model_b_case1, mse_b_case1 = train_linear_regression_model(X_train_case1, X_test_case1, y_train_case1, y_test_case1, ['gender', 'category'])
model_c_case2, mse_c_case2 = train_linear_regression_model(X_train_case2, X_test_case2, y_train_case2, y_test_case2, X_train_case2.columns)
model_d_case2, mse_d_case2 = train_linear_regression_model(X_train_case2, X_test_case2, y_train_case2, y_test_case2, ['gender', 'category'])

# Print mean squared errors for each model
print("Mean Squared Errors:")
print("Model A (Case 1, all features):", mse_a_case1)
print("Model B (Case 1, gender and category only):", mse_b_case1)
print("Model C (Case 2, all features):", mse_c_case2)
print("Model D (Case 2, gender and category only):", mse_d_case2)


Mean Squared Errors:
Model A (Case 1, all features): 2.3686227315749044
Model B (Case 1, gender and category only): 2.3686227315749044
Model C (Case 2, all features): 12.171902103846056
Model D (Case 2, gender and category only): 12.171902103846056


In [12]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Define the subsets of features for each case
features_case1_most_correlated = ['helpfulness', 'review']
features_case1_least_correlated = ['gender', 'category']
features_case2_most_correlated = ['helpfulness', 'review']
features_case2_least_correlated = ['gender', 'category']

# Preprocess categorical and text features separately
categorical_features = ['gender', 'category']
text_features = ['review']
numerical_features = ['helpfulness']

# Define preprocessing steps for categorical, text, and numerical features
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(), categorical_features),
        ('text', CountVectorizer(), 'review'),
        ('num', StandardScaler(), numerical_features)
    ])

# Define a function to train linear regression models
def train_linear_regression_model(X_train, X_test, y_train, y_test, features):
    # Initialize and train the linear regression model
    model = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('regressor', LinearRegression())
    ])
    model.fit(X_train[features], y_train)

    # Make predictions
    y_pred = model.predict(X_test[features])

    # Calculate mean squared error
    mse = mean_squared_error(y_test, y_pred)

    return model, mse

# Train models for each case and feature selection
model_a_case1, mse_a_case1 = train_linear_regression_model(X_train_case1, X_test_case1, y_train_case1, y_test_case1, features_case1_most_correlated)
model_b_case1, mse_b_case1 = train_linear_regression_model(X_train_case1, X_test_case1, y_train_case1, y_test_case1, features_case1_least_correlated)
model_c_case2, mse_c_case2 = train_linear_regression_model(X_train_case2, X_test_case2, y_train_case2, y_test_case2, features_case2_most_correlated)
model_d_case2, mse_d_case2 = train_linear_regression_model(X_train_case2, X_test_case2, y_train_case2, y_test_case2, features_case2_least_correlated)

# Print mean squared errors for each model
print("Mean Squared Errors:")
print("Model A (Case 1, most correlated features):", mse_a_case1)
print("Model B (Case 1, least correlated features):", mse_b_case1)
prin


ValueError: A given column is not a column of the dataframe

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Calculate correlations between features and target variable
correlations = df.corr()['rating'].drop('rating')

# Get two most correlated features
features_most_correlated = correlations.abs().nlargest(2).index.tolist()

# Get two least correlated features
features_least_correlated = correlations.abs().nsmallest(2).index.tolist()

# Select features for each case
features_case1_most_correlated = features_most_correlated
features_case1_least_correlated = features_least_correlated
features_case2_most_correlated = features_most_correlated
features_case2_least_correlated = features_least_correlated

# Define a function to train linear regression models
def train_linear_regression_model(X_train, X_test, y_train, y_test, features):
    # Select subset of features
    X_train_subset = X_train[features]
    X_test_subset = X_test[features]

    # Initialize and train the linear regression model
    model = LinearRegression()
    model.fit(X_train_subset, y_train)

    # Make predictions
    y_pred = model.predict(X_test_subset)

    # Calculate mean squared error
    mse = mean_squared_error(y_test, y_pred)

    return model, mse

# Train models for each case and feature selection
model_a_case1, mse_a_case1 = train_linear_regression_model(X_train_case1, X_test_case1, y_train_case1, y_test_case1, features_case1_most_correlated)
model_b_case1, mse_b_case1 = train_linear_regression_model(X_train_case1, X_test_case1, y_train_case1, y_test_case1, features_case1_least_correlated)
model_c_case2, mse_c_case2 = train_linear_regression_model(X_train_case2, X_test_case2, y_train_case2, y_test_case2, features_case2_most_correlated)
model_d_case2, mse_d_case2 = train_linear_regression_model(X_train_case2, X_test_case2, y_train_case2, y_test_case2, features_case2_least_correlated)

# Print mean squared errors for each model
print("Mean Squared Errors:")
print("Model A (Case 1, most correlated features):", mse_a_case1)
print("Model B (Case 1, least correlated features):", mse_b_case1)
print("Model C (Case 2, most correlated features):", mse_c_case2)
print("Model D (Case 2, least correlated features):", mse_d_case2)


In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer

# Define the subsets of features for each case
features_case1_most_correlated = ['helpfulness', 'review']
features_case1_least_correlated = ['gender', 'category']
features_case2_most_correlated = ['helpfulness', 'review']
features_case2_least_correlated = ['gender', 'category']

# Define preprocessing steps for categorical and text features
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(), ['gender', 'category']),
        ('text', CountVectorizer(), 'review')
    ])

# Train linear regression models
model_a_case1 = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
]).fit(X_train_case1, y_train_case1)

model_b_case1 = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
]).fit(X_train_case1, y_train_case1)

model_c_case2 = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
]).fit(X_train_case2, y_train_case2)

model_d_case2 = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
]).fit(X_train_case2, y_train_case2)

# Make predictions and evaluate the models
mse_a_case1 = mean_squared_error(y_test_case1, model_a_case1.predict(X_test_case1))
mse_b_case1 = mean_squared_error(y_test_case1, model_b_case1.predict(X_test_case1))
mse_c_case2 = mean_squared_error(y_test_case2, model_c_case2.predict(X_test_case2))
mse_d_case2 = mean_squared_error(y_test_case2, model_d_case2.predict(X_test_case2))

# Print mean squared errors for each model
print("Mean Squared Errors:")
print("Model A (Case 1, most correlated features):", mse_a_case1)
print("Model B (Case 1, least correlated features):", mse_b_case1)
print("Model C (Case 2, most correlated features):", mse_c_case2)
print("Model D (Case 2, least correlated features):", mse_d_case2)



In [None]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Define the subsets of features for each case
features_case1_most_correlated = ['helpfulness', 'review']
features_case1_least_correlated = ['gender', 'category']
features_case2_most_correlated = ['helpfulness', 'review']
features_case2_least_correlated = ['gender', 'category']

# Preprocess categorical and numerical features separately
categorical_features = ['gender', 'category', 'review']
numerical_features = ['helpfulness']

# Define preprocessing steps for categorical and numerical features
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

numerical_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Train linear regression models
model_a_case1 = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
]).fit(X_train_case1, y_train_case1)

model_b_case1 = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
]).fit(X_train_case1, y_train_case1)

model_c_case2 = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
]).fit(X_train_case2, y_train_case2)

model_d_case2 = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
]).fit(X_train_case2, y_train_case2)

# Make predictions and evaluate the models
mse_a_case1 = mean_squared_error(y_test_case1, model_a_case1.predict(X_test_case1))
mse_b_case1 = mean_squared_error(y_test_case1, model_b_case1.predict(X_test_case1))
mse_c_case2 = mean_squared_error(y_test_case2, model_c_case2.predict(X_test_case2))
mse_d_case2 = mean_squared_error(y_test_case2, model_d_case2.predict(X_test_case2))

# Print mean squared errors for each model
print("Mean Squared Errors:")
print("Model A (Case 1, most correlated features):", mse_a_case1)
print("Model B (Case 1, least correlated features):", mse_b_case1)
print("Model C (Case 2, most correlated features):", mse_c_case2)
print("Model D (Case 2, least correlated features):", mse_d_case2)


In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Define the subsets of features for each case
features_case1_most_correlated = ['helpfulness', 'review']
features_case1_least_correlated = ['gender', 'category']
features_case2_most_correlated = ['helpfulness', 'review']
features_case2_least_correlated = ['gender', 'category']

# Preprocess categorical features
ct = ColumnTransformer([('encoder', OneHotEncoder(), ['gender', 'category', 'review'])], remainder='passthrough')
X_train_case1_processed = ct.fit_transform(X_train_case1)
X_test_case1_processed = ct.transform(X_test_case1)
X_train_case2_processed = ct.fit_transform(X_train_case2)
X_test_case2_processed = ct.transform(X_test_case2)

# Train linear regression models
model_a_case1 = LinearRegression().fit(X_train_case1_processed, y_train_case1)
model_b_case1 = LinearRegression().fit(X_train_case1_processed, y_train_case1)
model_c_case2 = LinearRegression().fit(X_train_case2_processed, y_train_case2)
model_d_case2 = LinearRegression().fit(X_train_case2_processed, y_train_case2)

# Make predictions and evaluate the models
mse_a_case1 = mean_squared_error(y_test_case1, model_a_case1.predict(X_test_case1_processed))
mse_b_case1 = mean_squared_error(y_test_case1, model_b_case1.predict(X_test_case1_processed))
mse_c_case2 = mean_squared_error(y_test_case2, model_c_case2.predict(X_test_case2_processed))
mse_d_case2 = mean_squared_error(y_test_case2, model_d_case2.predict(X_test_case2_processed))

# Print mean squared errors for each model
print("Mean Squared Errors:")
print("Model A (Case 1, most correlated features):", mse_a_case1)
print("Model B (Case 1, least correlated features):", mse_b_case1)
print("Model C (Case 2, most correlated features):", mse_c_case2)
print("Model D (Case 2, least correlated features):", mse_d_case2)


In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Define the subsets of features for each case
features_case1_most_correlated = ['helpfulness', 'review']
features_case1_least_correlated = ['gender', 'category']
features_case2_most_correlated = ['helpfulness', 'review']
features_case2_least_correlated = ['gender', 'category']

# Train linear regression models
def train_linear_regression_model(X_train, X_test, y_train, y_test, features):
    # Select the subset of features
    X_train_subset = X_train[features]
    X_test_subset = X_test[features]
    
    # Initialize and train the linear regression model
    model = LinearRegression()
    model.fit(X_train_subset, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test_subset)
    
    # Evaluate the model
    mse = mean_squared_error(y_test, y_pred)
    
    return model, mse

# Train models for each case and feature selection
model_a_case1, mse_a_case1 = train_linear_regression_model(X_train_case1, X_test_case1, y_train_case1, y_test_case1, features_case1_most_correlated)
model_b_case1, mse_b_case1 = train_linear_regression_model(X_train_case1, X_test_case1, y_train_case1, y_test_case1, features_case1_least_correlated)
model_c_case2, mse_c_case2 = train_linear_regression_model(X_train_case2, X_test_case2, y_train_case2, y_test_case2, features_case2_most_correlated)
model_d_case2, mse_d_case2 = train_linear_regression_model(X_train_case2, X_test_case2, y_train_case2, y_test_case2, features_case2_least_correlated)

# Print mean squared errors for each model
print("Mean Squared Errors:")
print("Model A (Case 1, most correlated features):", mse_a_case1)
print("Model B (Case 1, least correlated features):", mse_b_case1)
print("Model C (Case 2, most correlated features):", mse_c_case2)
print("Model D (Case 2, least correlated features):", mse_d_case2)


In [13]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Function to select two most correlated features and two least correlated features
def select_features(features):
    # Implement feature selection logic here
    return most_correlated_features, least_correlated_features

# Function to split the dataset into training and testing data for case 1 and case 2
def split_data(dataset):
    # Implement data splitting logic for case 1 and case 2 here
    return case1_train_X, case1_test_X, case1_train_y, case1_test_y, case2_train_X, case2_test_X, case2_train_y, case2_test_y

# Function to train linear regression model
def train_model(train_X, train_y):
    model = LinearRegression()
    model.fit(train_X, train_y)
    return model

# Function to evaluate model performance
def evaluate_model(model, test_X, test_y):
    # Implement evaluation logic here
    return model_performance

# Load dataset
dataset = pd.read_csv('C:/Users/Wassim/cleaned_ecommerce_dataset.csv')

# Select features
most_correlated_features, least_correlated_features = select_features(dataset.features)

# Split data
case1_train_X, case1_test_X, case1_train_y, case1_test_y, case2_train_X, case2_test_X, case2_train_y, case2_test_y = split_data(dataset)

# Train models
model_a = train_model(case1_train_X[most_correlated_features], case1_train_y)
model_b = train_model(case1_train_X[least_correlated_features], case1_train_y)
model_c = train_model(case2_train_X[most_correlated_features], case2_train_y)
model_d = train_model(case2_train_X[least_correlated_features], case2_train_y)

# Evaluate models
performance_a = evaluate_model(model_a, case1_test_X[most_correlated_features], case1_test_y)
performance_b = evaluate_model(model_b, case1_test_X[least_correlated_features], case1_test_y)
performance_c = evaluate_model(model_c, case2_test_X[most_correlated_features], case2_test_y)
performance_d = evaluate_model(model_d, case2_test_X[least_correlated_features], case2_test_y)

# Print model performances
print("Model-a Performance:", performance_a)
print("Model-b Performance:", performance_b)
print("Model-c Performance:", performance_c)
print("Model-d Performance:", performance_d)


AttributeError: 'DataFrame' object has no attribute 'features'

In [14]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
df.dropna()

# Load your dataset (r)
df = pd.read_csv('C:/Users/Wassim/cleaned_ecommerce_dataset.csv')

# Calculate correlations
corr_matrix = df.corr()

# Identify most and least correlated features (adjust column names)
most_corr_features = corr_matrix['rating'].nlargest(3)[1:3].index.tolist() 
least_corr_features = corr_matrix['rating'].nsmallest(3)[:-1].index.tolist()

# Case 1 Data
data_case_1 = df[['rating'] + most_corr_features]

# Case 2 Data
data_case_2 = df[['rating'] + least_corr_features]

# Function for model training and evaluation
def train_and_evaluate(data, model_name):
    X = data.drop('rating', axis=1)
    y = data['rating']

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 

    model = LinearRegression()
    model.fit(X_train, y_train)  

    y_pred = model.predict(X_test)

    print(f'Model: {model_name}')
    print('R-squared:', r2_score(y_test, y_pred))
    print('Mean Squared Error:', mean_squared_error(y_test, y_pred))
    print('-------------------')

# Train the models
train_and_evaluate(data_case_1, 'model-a')
train_and_evaluate(data_case_1, 'model-b') 
train_and_evaluate(data_case_2, 'model-c') 
train_and_evaluate(data_case_2, 'model-d') 


ValueError: could not convert string to float: 'Not always McCrap'

In [None]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

# ---------------------- Step 1: Data Preprocessing ----------------------
# Load your dataset (replace plac
df = pd.read_csv('C:/Users/Wassim/cleaned_ecommerce_dataset.csv'your_data.csv')

# Handle missing values (adjust as needed)
data.dropna(inplace=True)  # Example: remove rows with missing values

# Encode categorical features (if necessary)
data = pd.get_dummies(data, columns=['gender', 'category']) 

# ---------------------- Step 2: Correlation Analysis ----------------------
corr_matrix = data.corr()

most_corr_features = corr_matrix['rating'].nlargest(3)[1:].index.tolist()  # Top 2 (+ rating)
least_corr_features = corr_matrix['rating'].nsmallest(3)[1:].index.tolist()  # Bottom 2 (+ rating) 

# ---------------------- Step 3: Feature Selection ----------------------
datasets = {
    'A': data[['rating'] + most_corr_features],
    'B': data[['rating'] + least_corr_features],
    'C': data[['rating'] + most_corr_features],  # Same as A in this example, adjust for Case 2
    'D': data[['rating'] + least_corr_features]  # Same as B in this example, adjust for Case 2
}

# ---------------------- Step 4: Model Building ----------------------
models = {}
for dataset_name, dataset in datasets.items():
    # Define your Case 1 and Case 2 test sizes here
    test_size_case_1 = 0.2  
    test_size_case_2 = 0.3  # (Adjust as needed)

    if dataset_name in ['A', 'B']:
        test_size = test_size_case_1
    else:
        test_size = test_size_case_2

    X = dataset.drop('rating', axis=1)
    y = dataset['rating']

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=42)

    model = LinearRegression()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    models[dataset_name] = {
        'model': model,
        'r2': r2_score(y_test, y_pred),
        'mse': mean_squared_error(y_test, y_pred),
        'mae': mean_absolute_error(y_test, y_pred)
    }

# ---------------------- Step 5: Evaluation ----------------------
print("Model Performance:")
for dataset_name, metrics in models.items():
    print(f"\nModel {dataset_name}:")
    print(f"  R-squared: {metrics['r2']}")
    print(f"  MSE: {metrics['mse']}")
    print(f"  MAE: {metrics['mae']}")


In [15]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Load your dataset
df = pd.read_csv("C:/Users/Wassim/cleaned_ecommerce_dataset.csv")

# Preprocessing 
df = df.dropna() 

# Sentiment Analysis
def get_sentiment(text):
    analyzer = SentimentIntensityAnalyzer()
    return analyzer.polarity_scores(text)['compound']  

df['sentiment'] = df['review'].apply(get_sentiment)

# Correlation Analysis
corr_matrix = df.corr()
most_corr_features = corr_matrix['rating'].nlargest(3).index[1:]
least_corr_features = corr_matrix['rating'].nsmallest(3).index[1:]

# Function for model training and evaluation
def train_and_evaluate(features, X_train, X_test, y_train, y_test):
    model = LinearRegression().fit(X_train, y_train)
    predictions = model.predict(X_test)
    r2 = r2_score(y_test, predictions)
    mse = mean_squared_error(y_test, predictions)
    return r2, mse

# ------ CASE 1: 80/20 SPLIT --------
X = df[['helpfulness', 'gender', 'category', 'sentiment']]  # Use 'sentiment'
y = df['rating']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model-a: Most correlated 
r2_a, mse_a = train_and_evaluate(most_corr_features, X_train, X_test, y_train, y_test)

# Model-b: Least correlated
r2_b, mse_b = train_and_evaluate(least_corr_features, X_train, X_test, y_train, y_test)


# ------ CASE 2: 90/10 SPLIT --------
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42) 

# Model-c: Most correlated 
r2_c, mse_c = train_and_evaluate(most_corr_features, X_train, X_test, y_train, y_test)

# Model-d: Least correlated
r2_d, mse_d = train_and_evaluate(least_corr_features, X_train, X_test, y_train, y_test)

# -------- Results --------
print("CASE 1:")
print("Model-a (Most Correlated): R2 =", r2_a, "MSE =", mse_a)
print("Model-b (Least Correlated): R2 =", r2_b, "MSE =", mse_b)

print("CASE 2:")
print("Model-c (Most Correlated): R2 =", r2_c, "MSE =", mse_c)
print("Model-d (Least Correlated): R2 =", r2_d, "MSE =", mse_d)


ValueError: could not convert string to float: 'Not always McCrap'

In [None]:
import nltk
nltk.download('punkt') # A crucial corpus for TextBlob

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from textblob import TextBlob  # Example sentiment analysis library

# ---------------------- Step 1: Data Preprocessing ----------------------
# Load your dataset (replace placeholder)
data = pd.read_csv('your_data.csv')

# Handle missing values (adjust as needed)
data.dropna(inplace=True)

# Encode categorical features (if necessary)
data = pd.get_dummies(data, columns=['gender', 'category']) 

# ---------------------- Step 2: Sentiment Analysis  ----------------------
def get_sentiment(review):
    """Calculates sentiment polarity using TextBlob"""
    analysis = TextBlob(review)
    return analysis.sentiment.polarity  # Polarity ranges from -1 (negative) to 1 (positive) 

# Apply Sentiment Analysis
data['sentiment'] = data['review'].apply(get_sentiment)

# ---------------------- Step 3: Correlation Analysis ----------------------
# Select appropriate features (adjust as needed)
features = ['rating', 'helpfulness', 'sentiment']  
corr_matrix = data[features].corr()

# ... (Rest of your code to select features, build models, and evaluate) ...  


In [None]:
df = pd.read_csv("C:\\/Users/Wassim/cleaned_ecommerce_dataset.csv")
df.head(10)  # Show the first 10 rows

In [None]:
import pandas as pd
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Load your dataset
df = pd.read_csv("C:/Users/Wassim/cleaned_ecommerce_dataset.csv")

# Sentiment Analysis Function
def get_sentiment(text):
    analyzer = SentimentIntensityAnalyzer()
    return analyzer.polarity_scores(text)['compound']  

# Add a sentiment column
df['sentiment'] = df['review'].apply(get_sentiment)

# Feature Selection - Update your feature set
X = df[['helpfulness', 'gender', 'category', 'sentiment']] 
y = df['rating']
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error

# Load your CSV dataset
df = pd.read_csv("C:/Users/Wassim/cleaned_ecommerce_dataset.csv")

# Preprocessing (Illustrative - adapt as needed)
df = df.dropna()  
df = df[(np.abs(df['rating'] - df['rating'].mean()) <= (3 * df['rating'].std()))] 

# Correlation Analysis
corr_matrix = df.corr()

most_correlated_features = corr_matrix['rating'].nlargest(3).index[1:]
least_correlated_features = corr_matrix['rating'].nsmallest(3).index[1:]

print("Most Correlated Features:", most_correlated_features)
print("Least Correlated Features:", least_correlated_features)

# Feature Selection and Model Building
def train_and_evaluate(features, X_train, X_test, y_train, y_test):
    model = LinearRegression().fit(X_train, y_train)
    predictions = model.predict(X_test)

    r2 = r2_score(y_test, predictions)
    mse = mean_squared_error(y_test, predictions)

    return r2, mse

# Case 1: 80/20 split
X = df[['helpfulness', 'gender', 'category', 'review']]  # Adjust columns if needed
y = df['rating']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model-a: Most correlated features
r2_a, mse_a = train_and_evaluate(most_correlated_features, X_train, X_test, y_train, y_test)

# Model-b: Least correlated features
r2_b, mse_b = train_and_evaluate(least_correlated_features, X_train, X_test, y_train, y_test)

# Case 2: 90/10 split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42) 

# Model-c 
r2_c, mse_c = train_and_evaluate(most_correlated_features, X_train, X_test, y_train, y_test)

# Model-d
r2_d, mse_d = train_and_evaluate(least_correlated_features, X_train, X_test, y_train, y_test)

# Results and Analysis
print("-------- Model Performance ---------")
print("Case 1:")
print("Model-a (Most Correlated): R2 =", r2_a, "MSE =", mse_a)
print("Model-b (Least Correlated): R2 =", r2_b, "MSE =", mse_b)
print("Case 2:")
print("Model-c (Most Correlated): R2 =", r2_c, "MSE =", mse_c)
print("Model-d (Least Correlated): R2 =", r2_d, "MSE =", mse_d)


In [None]:
correlations = data.corr()
rating_correlations = correlations['rating']

In [None]:
import nltk
nltk.download('vader_lexicon')

In [None]:
import pandas as pd
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Load your dataset (same as before)
df = pd.read_csv("C:/Users/Wassim/cleaned_ecommerce_dataset.csv")

# --- Sentiment Analysis ---
def get_sentiment(text):
    analyzer = SentimentIntensityAnalyzer()  # Create a sentiment analyzer
    return analyzer.polarity_scores(text)['compound']  # Get sentiment score

df['sentiment'] = df['review'].apply(get_sentiment)  # Add 'sentiment' column

# --- Feature Selection ---
X = df[['helpfulness', 'gender', 'category', 'sentiment']]  # Use 'sentiment'
y = df['rating']

# --- Model Training and Analysis (Your existing code) ---
# ... (Train your models with the new 'sentiment' feature)


In [None]:
# Two most correlated features
most_correlated_features = rating_correlations.nlargest(2).index.tolist()

# Two least correlated features
least_correlated_features = rating_correlations.nsmallest(2).index.tolist()  

print("Most correlated features:", most_correlated_features)
print("Least correlated features:", least_correlated_features)

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression  # Example regression model
from sklearn.metrics import r2_score, mean_squared_error

# Load your dataset (assuming you have it in a DataFrame named 'data')

# ... (Your code to load the dataset) ... 

# Assuming the following (replace with the actual output from your previous step):
most_correlated_features = ['helpful', 'gender'] 
least_correlated_features = ['category', 'review'] 

# Split into features (X) and target variable (y)
X = data[['helpful', 'gender', 'category', 'review']] 

y = data['rating']

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 

# Model 1: Most correlated features
model_1 = LinearRegression()
model_1.fit(X_train[most_correlated_features], y_train)
y_pred_1 = model_1.predict(X_test[most_correlated_features])

# Model 2: Least correlated features
model_2 = LinearRegression()
model_2.fit(X_train[least_correlated_features], y_train)
y_pred_2 = model_2.predict(X_test[least_correlated_features])

# Evaluation
print('Model 1 (Most Correlated) R-squared:', r2_score(y_test, y_pred_1))
print('Model 2 (Least Correlated) R-squared:', r2_score(y_test, y_pred_2))

# You can also calculate other metrics like mean squared error (MSE)


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

# Step 1: Loading the Dataset
data = pd.read_csv('C:/Users/Wassim/Downloads/cleaned_ecommerce_dataset.csv')

# Step 2: Verify Data and Columns
print(data.head())  #  Inspect the first few rows
print(data.info())  #  Check data types and columns
print(data.shape)   #  Print the dataset's dimensions

# Step 3: Analyzing Correlations
correlations = data.corr()
rating_correlations = correlations['rating']

most_correlated_features = rating_correlations.nlargest(2).index.tolist()
least_correlated_features = rating_correlations.nsmallest(2).index.tolist()

print("Most correlated features:", most_correlated_features)
print("Least correlated features:", least_correlated_features)

# Step 4: Preparing Data 
X = data[['helpfulness', 'gender', 'category', 'review']]  # Adjust columns if needed
y = data['rating']

# Step 5: Splitting Data (Make sure you have enough data first)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 

# Step 6: Model Building and Evaluation
def train_and_evaluate(features):
    model = LinearRegression()
    model.fit(X_train[features], y_train)
    y_pred = model.predict(X_test[features])

    r2 = r2_score(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)

    print(f'Model using features: {features}')
    print('R-squared:', r2)
    print('Mean Squared Error:', mse)
    print('-------------------')

train_and_evaluate(most_correlated_features)
train_and_evaluate(least_correlated_features)

### Evaluate Models
* Evaluate the performance of the four models with two metrics, including MSE and Root MSE
* Print the results of the four models regarding the two metrics

In [None]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from textblob import TextBlob
import numpy as np 

# ... (Your data loading, preprocessing, sentiment analysis code) ...

# ---------------------- Step 4: Model Building ----------------------
# ... (Your dataset creation code) ...

models = {}
for dataset_name, dataset in datasets.items():
    # ... (Your train_test_split code) ...

    model = LinearRegression()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    models[dataset_name] = {
        'model': model,
        'r2': r2_score(y_test, y_pred),
        'mse': mean_squared_error(y_test, y_pred),
        'rmse': np.sqrt(mean_squared_error(y_test, y_pred)),  # Calculate RMSE
        'mae': mean_absolute_error(y_test, y_pred)
    }

# ---------------------- Step 5: Evaluation ----------------------
print("Model Performance:")
for dataset_name, metrics in models.items():
    print(f"\nModel {dataset_name}:")
    print(f"  R-squared: {metrics['r2']}")
    print(f"  MSE: {metrics['mse']}")
    print(f"  RMSE: {metrics['rmse']}") 
    print(f"  MAE: {metrics['mae']}") 


### Visualize, Compare and Analyze the Results
* Visulize the results, and perform ___insightful analysis___ on the obtained results. For better visualization, you may need to carefully set the scale for the y-axis.
* Normally, the model trained with most correlated features and more training data will get better results. Do you obtain the similar observations? If not, please ___explain the possible reasons___.

In [None]:
import pandas as pd

df = pd.read_csv('C:/Users/Wassim/Downloads/cleaned_ecommerce_dataset.csv')

# Potential Preprocessing
df.dropna(subset=['helpfulness', 'rating'], inplace=True)  # Drop rows with missing values 


In [None]:
corr_matrix = df.corr()

# Identify most/least correlated features with the 'rating' column
most_corr_features = corr_matrix['rating'].nlargest(3)[1:3].index.tolist() 
least_corr_features = corr_matrix['rating'].nsmallest(3)[:-1].index.tolist() 


In [16]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

X = [[1, 2], [3, 4], [5, 6]]  # Sample data
y = [0, 1, 1]  # Sample target labels

X_train, X_test, y_train, y_test = train_test_split(X, y)

# Fit a model (replace with your actual model)
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred) 
print(accuracy)


1.0


### Data Science Ethics
*Please read the following examples [Click here to read the example_1.](https://www.vox.com/covid-19-coronavirus-us-response-trump/2020/5/18/21262265/georgia-covid-19-cases-declining-reopening) [Click here to read the example_2.](https://viborc.com/ethics-and-ethical-data-visualization-a-complete-guide/)

*Then view the picture ![My Image](figure_portfolio2.png "This is my image")
Please compose an analysis of 100-200 words that evaluates potential ethical concerns associated with the infographic, detailing the reasons behind these issues.


In [None]:
Privacy:  How was the athlete data collected?  Without explicitly mentioning data collection methods, concerns linger about whether athletes provided informed consent regarding the use of their performance metrics.  Respecting athlete privacy is paramount, especially when data can be linked to individuals.

Bias:  The infographic lacks information about its data source. If the source itself is inherently biased (e.g., favoring certain nations or sports),  the presented medal counts might not offer an accurate or fair global representation.  Transparency about data origins is crucial for preventing misinterpretation.

Potential for Misuse:  Focusing purely on medal counts can be misleading.  Factors beyond athletic prowess—like a country's investment in sports infrastructure or focus on specific disciplines—can heavily influence the results. Using this data to fuel nationalistic narratives or make unfair comparisons would be an ethical misuse.

Responsible Data Science:  This infographic highlights the need for responsible data science practices.  Ethical considerations should encompass informed consent, addressing biases, anticipating potential misuse, and prioritizing transparency of data sources and methodologies.