<a href="https://colab.research.google.com/github/VicentePina7210/DataMiningCleaningExercise/blob/main/Sentiment_Classification_Exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [15]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

In [3]:
# Download dataset
!gdown --id 1S6qMioqPJjyBLpLVz4gmRTnJHnjitnuV
!gdown --id 1zdmewp7ayS4js4VtrJEHzAheSW-5NBZv

# Load dataset
data_df = pd.read_csv("reviews.csv")

Downloading...
From: https://drive.google.com/uc?id=1S6qMioqPJjyBLpLVz4gmRTnJHnjitnuV
To: /content/apps.csv
100% 134k/134k [00:00<00:00, 85.4MB/s]
Downloading...
From: https://drive.google.com/uc?id=1zdmewp7ayS4js4VtrJEHzAheSW-5NBZv
To: /content/reviews.csv
100% 7.17M/7.17M [00:00<00:00, 165MB/s]


In [7]:
# Basic formatting to focus problem
data_df = data_df[data_df["score"] != 3] # Drop neutral cases
data_df["sentiment"] = data_df["score"] > 3 # Define sentiment based on if score was above 3
data_df = data_df[["content", "sentiment"]]
data_df = data_df.sample(frac=0.5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_df["sentiment"] = data_df["score"] > 3 # Define sentiment based on if score was above 3


1. Examine a few text samples. What can be done to clean this text given your observations?

In [8]:
print(data_df.shape)
print(data_df.columns)
print(data_df.head())
print(data_df.tail())

(5352, 2)
Index(['content', 'sentiment'], dtype='object')
                                                 content  sentiment
2586   Why tick tick for windows is available for 15 ...      False
11940                               Just the app I need!       True
351    Too much spam. Stuff that I don't care for pop...      False
2276   Very customizable planner, easy to navigate, b...       True
14438  Very easy to use. Does the job well. Has all t...       True
                                                 content  sentiment
13328  Especially like that there is a widget option ...       True
1554   New layout and functionality is no bueno. Don'...      False
15650                                 Great for stalking       True
10387  It took me a while to understand how to use ce...       True
8312   Really simple and good application. Good job guys       True


2. Implement a simple text cleaning script

In [9]:
data_df['cleaned_content'] = data_df['content'].str.lower()
data_df['cleaned_content'] = data_df['content'].str.strip()
data_df['cleaned_content'] = data_df['content'].str.replace(r'[^a-zA-Z\s]', '', regex=True)
data_df['cleaned_content'] = data_df['content'].str.replace(r'[^\w\s]', '', regex=True)

print(data_df.head())


                                                 content  sentiment  \
2586   Why tick tick for windows is available for 15 ...      False   
11940                               Just the app I need!       True   
351    Too much spam. Stuff that I don't care for pop...      False   
2276   Very customizable planner, easy to navigate, b...       True   
14438  Very easy to use. Does the job well. Has all t...       True   

                                         cleaned_content  
2586   Why tick tick for windows is available for 15 ...  
11940                                Just the app I need  
351    Too much spam Stuff that I dont care for pops ...  
2276   Very customizable planner easy to navigate but...  
14438  Very easy to use Does the job well Has all the...  


3. Create a train-test data split

In [10]:
x = data_df['cleaned_content']
y = data_df['sentiment']

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)



4. Create a term document matrix and implement a filter to reduce the number of terms

In [16]:
documents = data_df['cleaned_content'].tolist()

vocabulary = set()

for word in documents:
  vocabulary.update(word)

vocabulary = sorted(vocabulary)

# Initialize the term-document matrix as a 2D NumPy array (rows: terms, columns: documents)
tdm_matrix = np.zeros((len(vocabulary), len(documents)))

# Create a mapping from each word to its row index in the matrix
word_index = {word: i for i, word in enumerate(vocabulary)}

# Populate the matrix
for doc_index, doc in enumerate(documents):
    for word in doc:
        if word in word_index:  # Ensure the word is in the vocabulary
            row_index = word_index[word]
            tdm_matrix[row_index, doc_index] += 1

# Convert the NumPy matrix to a DataFrame for easier viewing
tdm_df = pd.DataFrame(tdm_matrix, index=vocabulary, columns=[f'Doc_{i+1}' for i in range(len(documents))])

# Show the first few rows of the term-document matrix
tdm_df.head()


Unnamed: 0,Doc_1,Doc_2,Doc_3,Doc_4,Doc_5,Doc_6,Doc_7,Doc_8,Doc_9,Doc_10,...,Doc_5343,Doc_5344,Doc_5345,Doc_5346,Doc_5347,Doc_5348,Doc_5349,Doc_5350,Doc_5351,Doc_5352
\n,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
,48.0,4.0,41.0,22.0,19.0,26.0,9.0,92.0,37.0,38.0,...,5.0,11.0,34.0,32.0,23.0,40.0,15.0,2.0,92.0,7.0
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


5. Train a classifier and get the train + test accuracy

6. What are some other more attributes you can create for better classification?