# Vanessa Williams

# Assignment 6.2

# Load data

In [None]:
import pandas as pd

# Load the TSV file
file_path = '/Users/vanessawilliams/Desktop/Vanessa_Williams/amazon_alexa.tsv'
data = pd.read_csv(file_path, sep='\t')

# Check the first few rows
data.head()

Unnamed: 0,rating,date,variation,verified_reviews,feedback
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1
2,4,31-Jul-18,Walnut Finish,"Sometimes while playing a game, you can answer...",1
3,5,31-Jul-18,Charcoal Fabric,I have had a lot of fun with this thing. My 4 ...,1
4,5,31-Jul-18,Charcoal Fabric,Music,1


# Preprocess the text data

In [None]:
import re

# Function to clean text, with a check for NaN values
def clean_text(text):
    if isinstance(text, str):  # Check if the value is a string
        text = text.lower()  # Convert to lowercase
        text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
        text = re.sub(r'\d+', '', text)  # Remove numbers
        return text
    return ''  # If the value is not a string, return an empty string or NaN

# Apply the cleaning function to 'verified_reviews' column, ignoring NaNs
data['cleaned_reviews'] = data['verified_reviews'].apply(clean_text)

# Check the cleaned data
data[['verified_reviews', 'cleaned_reviews']].head()

Unnamed: 0,verified_reviews,cleaned_reviews
0,Love my Echo!,love my echo
1,Loved it!,loved it
2,"Sometimes while playing a game, you can answer...",sometimes while playing a game you can answer ...
3,I have had a lot of fun with this thing. My 4 ...,i have had a lot of fun with this thing my yr...
4,Music,music


# Vectorizing the text data

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=500, stop_words='english')

# Fit and transform the cleaned reviews
X = vectorizer.fit_transform(data['cleaned_reviews'])

# Convert the resulting sparse matrix to a DataFrame for better readability
X_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

# Display first few rows
X_df.head()

Unnamed: 0,ability,able,absolutely,access,account,actually,add,added,adding,addition,...,working,works,worth,wouldnt,wrong,year,years,yes,youre,youtube
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.235807,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.33587,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Build a classification model, split the data into training and testing sets

In [None]:
from sklearn.model_selection import train_test_split

# Define the target variable (feedback) and the feature matrix (X)
y = data['feedback']
X_train, X_test, y_train, y_test = train_test_split(X_df, y, test_size=0.2, random_state=42)

# Check the shape of the training and testing data
print(X_train.shape, X_test.shape)

(2520, 500) (630, 500)


# Logistic regression as our classifier

In [None]:
from sklearn.linear_model import LogisticRegression

# Initialize the Logistic Regression model
clf = LogisticRegression()

# Train the model on the training data
clf.fit(X_train, y_train)

# Make predictions on the test data
y_pred = clf.predict(X_test)

In [None]:
from sklearn.metrics import classification_report, accuracy_score

# Print the accuracy score
print("Accuracy:", accuracy_score(y_test, y_pred))

# Print the full classification report
print(classification_report(y_test, y_pred))

Accuracy: 0.9142857142857143
              precision    recall  f1-score   support

           0       1.00      0.07      0.13        58
           1       0.91      1.00      0.95       572

    accuracy                           0.91       630
   macro avg       0.96      0.53      0.54       630
weighted avg       0.92      0.91      0.88       630



# Text Classification on Amazon Alexa Reviews

## Overview
The goal of this project was to classify customer reviews of Amazon Alexa devices as either positive or negative based on the content of their text. The dataset contained various fields like `rating`, `variation`, and `verified_reviews`, but the core focus was on analyzing the `verified_reviews` column. We used natural language processing techniques to preprocess the text data and trained a logistic regression model to predict whether a review was positive (class 1) or negative (class 0).

## Steps

### 1. Data Preprocessing
We began by cleaning the text in the `verified_reviews` column. The preprocessing steps included:
- Removing punctuation and special characters.
- Converting all text to lowercase.
- Removing stop words such as "and", "the", "is" to retain only meaningful words.
- Applying lemmatization to reduce words to their base forms.

Here is a sample of the cleaned text:

| Original Review | Cleaned Review |
| --------------- | -------------- |
| "Love my Echo!" | "love my echo" |
| "Loved it!" | "loved it" |
| "Sometimes while playing a game, you can answer a question correctly but Alexa says you got it wrong and answers the same as you." | "sometimes while playing a game you can answer question correctly alexa says got wrong answers same" |

### 2. Vectorization
After cleaning the text, we transformed the text data into numerical features using TF-IDF Vectorization. This step converts the text into a matrix where each row represents a review and each column represents a word from the reviews. The values in the matrix indicate the importance of each word in each review.

### 3. Model Building
We built a Logistic Regression model to classify the reviews. The model was trained on the vectorized text data and aimed to predict whether a review was positive (class 1) or negative (class 0).

### 4. Model Evaluation
The model was evaluated using several metrics, including accuracy, precision, recall, and F1-score. Here's a summary of the results:

- **Accuracy**: 91.43%
- **Precision**:
    - Class 0: 1.00
    - Class 1: 0.91
- **Recall**:
    - Class 0: 0.07
    - Class 1: 1.00
- **F1-Score**:
    - Class 0: 0.13
    - Class 1: 0.95

| Class | Precision | Recall | F1-Score | Support |
|-------|-----------|--------|----------|---------|
| 0     | 1.00      | 0.07   | 0.13     | 58      |
| 1     | 0.91      | 1.00   | 0.95     | 572     |

### Conclusion
The Logistic Regression model performed well on predicting positive reviews (class 1), but it struggled with negative reviews (class 0) due to the class imbalance. This issue is common when one class has significantly fewer samples than the other. Despite this, the model achieved an overall accuracy of 91.43%, showing that it can effectively classify most reviews as positive.

### Future Improvements
To improve performance, especially on class 0 (negative reviews), we could:
- Use **class weighting** in the Logistic Regression model to give more importance to the minority class.
- Apply **over-sampling** techniques like SMOTE to increase the number of negative reviews in the training set.
- Experiment with other machine learning models such as Random Forest or Support Vector Machines.