<a href="https://colab.research.google.com/github/fdesdsf/ABIB/blob/main/text_analytics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd

file_path = '/content/drive/MyDrive/Classroom/flipkart_product.csv'

try:
    df = pd.read_csv(file_path, encoding='latin-1')
    print("Loaded with latin-1 encoding:")
    print(df.head())
except UnicodeDecodeError:
    try:
        df = pd.read_csv(file_path, encoding='cp1252')
        print("Loaded with cp1252 encoding:")
        print(df.head())
    except UnicodeDecodeError:
        print("Could not decode with latin-1 or cp1252. Try other encodings.")

Loaded with latin-1 encoding:
                                         ProductName     Price Rate  \
0  Candes 12 L Room/Personal Air Cooler?ÿ?ÿ(White...  ??3,999    5   
1  Candes 12 L Room/Personal Air Cooler?ÿ?ÿ(White...  ??3,999    5   
2  Candes 12 L Room/Personal Air Cooler?ÿ?ÿ(White...  ??3,999    3   
3  Candes 12 L Room/Personal Air Cooler?ÿ?ÿ(White...  ??3,999    1   
4  Candes 12 L Room/Personal Air Cooler?ÿ?ÿ(White...  ??3,999    3   

            Review                                            Summary  
0           Super!  Great cooler.. excellent air flow and for this...  
1          Awesome             Best budget 2 fit cooler. Nice cooling  
2             Fair  The quality is good but the power of air is de...  
3  Useless product                 Very bad product it's a only a fan  
4             Fair                                      Ok ok product  


In [None]:
# Get basic information about the DataFrame
print("DataFrame Info:")
df.info()

# Print a sample of non-head rows to see the content
print("\nSample of Data:")
print(df.sample(10))

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 189874 entries, 0 to 189873
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   ProductName  189874 non-null  object
 1   Price        189873 non-null  object
 2   Rate         189873 non-null  object
 3   Review       189870 non-null  object
 4   Summary      189860 non-null  object
dtypes: object(5)
memory usage: 7.2+ MB

Sample of Data:
                                              ProductName      Price Rate  \
80562   LUMINOUS ZELIO+ 1100/12V (E-comm) Pure Sine Wa...   ??7,299    3   
96779   APPLE 2020 Macbook Air M1 - (8 GB/256 GB SSD/M...  â¹86,990    5   
133508         POCO C31 (Royal Blue, 64 GB)?ÿ?ÿ(4 GB RAM)   ??7,499    5   
60833   Google Nest Hub (2nd gen), Display with Google...   ??6,999    5   
32184   Midea TORRINO,WQP12-5201F Free Standing 13 Pla...  ??25,990    4   
167492  cello Pack of 18 Opalware Cello Dazzle Lush Fi...   â

In [None]:
# Check for missing values in the 'Review' column
print("Missing values in 'Review' before handling:")
print(df['Review'].isnull().sum())

# Drop rows with missing values in the 'Review' column
df.dropna(subset=['Review'], inplace=True)

# Verify missing values after handling
print("\nMissing values in 'Review' after handling:")
print(df['Review'].isnull().sum())

# Convert 'Review' text to lowercase
df['Review'] = df['Review'].str.lower()

# Display the first few rows of the cleaned 'Review' column
print("\nFirst few cleaned reviews:")
print(df['Review'].head())

Missing values in 'Review' before handling:
4

Missing values in 'Review' after handling:
0

First few cleaned reviews:
0             super!
1            awesome
2               fair
3    useless product
4               fair
Name: Review, dtype: object


In [None]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Download VADER lexicon if you haven't already
nltk.download('vader_lexicon')

# Initialize the VADER sentiment analyzer
analyzer = SentimentIntensityAnalyzer()

# Function to get the negative sentiment score
def get_negative_sentiment_score(text):
    vs = analyzer.polarity_scores(text)
    return vs['neg']

# Apply the function to the 'Review' column to get the negative sentiment score
df['negative_sentiment_score'] = df['Review'].apply(get_negative_sentiment_score)

# Display the first few rows with the negative sentiment scores
print(df[['Review', 'negative_sentiment_score']].head())

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


            Review  negative_sentiment_score
0           super!                     0.000
1          awesome                     0.000
2             fair                     0.000
3  useless product                     0.737
4             fair                     0.000


In [None]:
# Define a threshold for considering a review as highly negative
negative_threshold = 0.5

# Filter the DataFrame to get reviews with a negative sentiment score above the threshold
highly_negative_reviews_df = df[df['negative_sentiment_score'] > negative_threshold]

# Display the number of highly negative reviews
print(f"Number of highly negative reviews (threshold > {negative_threshold}): {len(highly_negative_reviews_df)}")

# Display a sample of the highly negative reviews and their scores
print("\nSample of highly negative reviews:")
print(highly_negative_reviews_df[['Review', 'negative_sentiment_score']].head(10))

Number of highly negative reviews (threshold > 0.5): 13442

Sample of highly negative reviews:
                     Review  negative_sentiment_score
3           useless product                     0.737
48          waste of money!                     0.607
59                worthless                     1.000
67   worst experience ever!                     0.687
73          useless product                     0.737
80                 not good                     0.706
84    slightly disappointed                     0.737
87                 horrible                     1.000
99                very poor                     0.772
104         useless product                     0.737


In [None]:
# Keywords related to price
price_keywords = ['price', 'cost', 'expensive', 'overpriced', 'cheap (negative context)', 'money waste']

# Keywords related to lack of features
feature_keywords = ['feature missing', 'no feature', 'lacks', 'not have', 'without', 'basic']

# Keywords related to poor service
service_keywords = ['service bad', 'poor service', 'customer care worst', 'support bad', 'delivery late', 'damaged', 'broken']

# Function to check if any of the keywords are present in the review
def contains_keyword(text, keywords):
    if isinstance(text, str):
        for keyword in keywords:
            if keyword in text:
                return True
    return False

# Filter for negative reviews mentioning price-related keywords
negative_price_reviews_df = highly_negative_reviews_df[highly_negative_reviews_df['Review'].apply(lambda x: contains_keyword(x, price_keywords))]
print(f"\nNumber of highly negative reviews mentioning price: {len(negative_price_reviews_df)}")
print(negative_price_reviews_df[['Review', 'negative_sentiment_score']].head())

# Filter for negative reviews mentioning feature-related keywords
negative_feature_reviews_df = highly_negative_reviews_df[highly_negative_reviews_df['Review'].apply(lambda x: contains_keyword(x, feature_keywords))]
print(f"\nNumber of highly negative reviews mentioning lack of features: {len(negative_feature_reviews_df)}")
print(negative_feature_reviews_df[['Review', 'negative_sentiment_score']].head())

# Filter for negative reviews mentioning service-related keywords
negative_service_reviews_df = highly_negative_reviews_df[highly_negative_reviews_df['Review'].apply(lambda x: contains_keyword(x, service_keywords))]
print(f"\nNumber of highly negative reviews mentioning poor service: {len(negative_service_reviews_df)}")
print(negative_service_reviews_df[['Review', 'negative_sentiment_score']].head())


Number of highly negative reviews mentioning price: 0
Empty DataFrame
Columns: [Review, negative_sentiment_score]
Index: []

Number of highly negative reviews mentioning lack of features: 1
                                         Review  negative_sentiment_score
181496  worst and useless without installation                      0.697

Number of highly negative reviews mentioning poor service: 4
                         Review  negative_sentiment_score
1015                    damaged                     1.000
17775           damaged product                     0.744
53841           damaged product                     0.744
128449  came with broken cover.                     0.508


In [None]:
import nltk
import spacy
from nltk.corpus import stopwords

# Download stopwords if you haven't already
nltk.download('stopwords')

# Load the English language model for spaCy
nlp = spacy.load("en_core_web_sm", disable=["ner"]) # Disable Named Entity Recognition for speed

# Get the set of English stop words
stop_words = stopwords.words('english')

def preprocess_text(text):
    if isinstance(text, str):
        # Tokenize the text using spaCy
        doc = nlp(text)
        # Lemmatize and remove stop words and punctuation
        tokens = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct and not token.is_space]
        return " ".join(tokens)
    return ""

# Apply the preprocessing function to the 'Review' column
df['processed_review'] = df['Review'].apply(preprocess_text)

# Remove rows where the processed review is empty
df.dropna(subset=['processed_review'], inplace=True)
df = df[df['processed_review'].str.len() > 0]

# Display the first few processed reviews
print(df[['Review', 'processed_review']].head())

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


            Review processed_review
0           super!            super
1          awesome          awesome
2             fair             fair
3  useless product  useless product
4             fair             fair


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')

# Fit and transform the processed reviews
tfidf_matrix = tfidf_vectorizer.fit_transform(df['processed_review'])

# Get the feature names (words)
feature_names = tfidf_vectorizer.get_feature_names_out()

# Print the shape of the TF-IDF matrix and the number of features
print(f"Shape of TF-IDF matrix: {tfidf_matrix.shape}")
print(f"Number of unique features (words): {len(feature_names)}")

Shape of TF-IDF matrix: (189857, 415)
Number of unique features (words): 415


In [None]:
from sklearn.decomposition import LatentDirichletAllocation

# Define the number of topics you want to extract
num_topics = 10  # You can experiment with different numbers of topics

# Initialize the LDA model
lda_model = LatentDirichletAllocation(n_components=num_topics, random_state=42, n_jobs=-1)

# Fit the LDA model to the TF-IDF matrix
lda_model.fit(tfidf_matrix)

# Function to display the top words for each topic
def display_topics(model, feature_names, num_top_words=10):
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic {topic_idx}:")
        top_words_indices = topic.argsort()[:-num_top_words - 1:-1]
        top_words = [feature_names[i] for i in top_words_indices]
        print(" ".join(top_words))

# Display the discovered topics and their top words
print("\nDiscovered Topics:")
display_topics(lda_model, feature_names)


Discovered Topics:
Topic 0:
nice market okay product hate disappointed slightly performance month family
Topic 1:
simply awesome job waste bad money poor unsatisfactory experience quality
Topic 2:
awesome excellent recommend highly moderate meet expectation price range microwave
Topic 3:
good wow choice pretty quality product disappoint utterly work ball
Topic 4:
delightful worthless horrible juicer tv review need kitchen overprice laptop
Topic 5:
terrific wonderful product classy perfect purchase mixer grinder cool like
Topic 6:
product great penny worth decent terrible rubbish absolute expect star
Topic 7:
money blow mind purchase value worth fair superb budget inverter
Topic 8:
nan useless way product love fantastic perform slot honest face
Topic 9:
brilliant buy fabulous super camera inalsa box hand table link


In [None]:
import numpy as np

# Get the topic distribution for each document
topic_distributions = lda_model.transform(tfidf_matrix)

# Function to get the dominant topic for each document
def get_dominant_topic(topic_distribution):
    return np.argmax(topic_distribution)

# Apply the function to get the dominant topic for each review
df['dominant_topic'] = np.apply_along_axis(get_dominant_topic, 1, topic_distributions)

# Filter the DataFrame for reviews primarily associated with Topic 2 (index 2)
topic_2_reviews_df = df[df['dominant_topic'] == 2]
print("\nSample Reviews Primarily Associated with Topic 2 (Price/Alternatives):")
print(topic_2_reviews_df['Review'].head(10).values)

# Filter the DataFrame for reviews primarily associated with Topic 1 (index 1)
topic_1_reviews_df = df[df['dominant_topic'] == 1]
print("\nSample Reviews Primarily Associated with Topic 1 (Pain Points/Quality):")
print(topic_1_reviews_df['Review'].head(10).values)


Sample Reviews Primarily Associated with Topic 2 (Price/Alternatives):
['awesome' 'awesome' 'highly recommended' 'highly recommended' 'awesome'
 'awesome' 'excellent' 'awesome' 'excellent' 'awesome']

Sample Reviews Primarily Associated with Topic 1 (Pain Points/Quality):
['unsatisfactory' 'simply awesome' 'waste of money!' 'does the job'
 'worst experience ever!' 'simply awesome' 'simply awesome'
 'simply awesome' 'very poor' 'simply awesome']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['dominant_topic'] = np.apply_along_axis(get_dominant_topic, 1, topic_distributions)


In [None]:
!find /content/drive -name 'flipkart_product.csv'

/content/drive/MyDrive/Classroom/flipkart_product.csv
