# Sentiment Analysis of Tweets on Apple and Google Products

![](NLP_Image.jpg)

##  1. BUSINESS OVERVIEW  

###  1.1 **Business Understanding**

#### 1.1.1 **What is  sentiment analysis?**

**Sentiment Analysis** also known as **Sentiment Classification** in brief uses natural language processing to identify the emotional tone behind text, such as customer feedback, and categorize it as positive, negative, or neutral. 

The above can also be described as a text classification tasks, where we look at a phrase, or a list of phrases and use a classifier to tell if the sentiment behind that is:
- positive
- negative 
- neutral. 

In some cases, the third attribute is not taken to keep it a binary classification problem. 

In this project we will thus carry out a sentiment classification task where we will analyzes tweets, their emotions, and whether they are directed at a brand or product i.e Apple and Google products

#### 1.1.2 **How is sentiment analysis important to an organization?**

For your organization, sentiment analysis is crucial in the following ways:

- Understanding customer opinions for improving experiences, and addressing concerns proactively. 
- It helps to predict customer behavior for a particular product
- It can help to test the adaptability of a product
- Automates the task of customer preference reports.

The above are but a few benefits, but in general sentiment analysis assit business stake holders to also define various business problems regarding their products
#### 1.1.3 **An Overview of Apple Products**
Apple is one of the most recognized brands in the world valued at over $2 trillion in 2021. It is known for its innovative consumer electronics, including the iPhone, iPad, MacBook, and other devices. Apple’s next products, which may include a virtual reality headset and self-driving car. The past few product launches have been smaller in scope, like the HomePod and AirPod, and Apple fans are clamouring for the next iPhone.Given their wide range of products some mentioned above , below we have some statistics on their performance in 2023 as at September: 

- 231 million iPhones, 49 million iPads and 22 million Mac and MacBook units were sold in 2023
- Apple’s home and wearables division declined by 6.5% in 2023
- It sold 75 million AirPods and 38 million Apple Watches in 2023
- Apple Music has 93 million subscribers, Apple TV+ has 47 million

To be able to consistently get high revenues, Apple needs to continuously carry out sentiment analysis on the users' strong emotional reactions to the brand , which frequently result in a mix of positive and negative sentiments in their data.Tweets being a source of helpful data , looking at Tweets on Apple products could, among other things, cover customer service experiences, software upgrades, or the introduction of new items. 


#### 1.1.4 **An Overview of Google Products**

Google offers diverse products designed to enhance productivity, connectivity, and innovation.Key offerings include:
- Google Search
- Gmail
- Google Drive
- Google Workspace for organizing and collaborating
- YouTube
- Google Photos
- Google Play for entertainment
- Google Maps, Waze, and Google Earth for navigation. 
- Businesses benefit from Google Ads, Google Analytics, and Google Cloud Platform, while developers use tools like Firebase and   BigQuery.

Additionally, smart devices like Pixel phones, Nest home products, and Chromecast provide cutting-edge hardware solutions. Overall, Google's products aim to simplify daily life, empower businesses, and connect the world. In 2021 Statistics Highlighting Google's performance showed a revenue of $278.1 billion. Similar to Apple, for Google to continue to thrive, sentiment analysis is thus crucial to check on matters such as customer satisfaction.

#### 1.1.5 **Why Analyze Tweets?**
- **Social Media Influence**: Platforms like Twitter have become primary channels where customers share their feedback, both positive and negative, about brands and products.
- **Volume of Data**: The massive and real-time nature of tweets makes manual analysis impractical, necessitating automated solutions.
- **Business Impact**: Sentiment analysis of tweets can provide actionable insights to enhance customer experience, refine marketing strategies, and maintain a competitive edge.


#### 1.1.6   Stakeholders 

Sentiment analysis is important for various participants such as:

- **Business Managers**: Understand customer satisfaction and drive decision-making.
- **Marketing Teams**: Create sentiment-driven marketing campaigns.
- **Customer Service Teams**: Prioritize resolving issues flagged in negative reviews.

#### 1.1.7 **Challenges in Sentiment Analysis**
- **Unstructured Data**: Tweets are often informal, with abbreviations, slang, and emojis, making preprocessing essential.
- **Ambiguity**: Some texts may have mixed sentiments or implicit emotions that are challenging to classify.
- **Scalability**: Handling and processing large datasets efficiently is a significant challenge.

#### 1.1.8 **Proposed Solutions**
##### 1.1.8 Approach Methodology: 

##### 1.To execute the sentiment analysis . The following is the execution plan:
- Begin with simple approaches like bag-of-words or TF-IDF vectorization 
- Proceed to commplex methods (e.g., word embeddings or transformers)

##### 2. Pre-trained Tools: 
* NLP has many pre-trained models (e.g., spaCy, NLTK, Hugging Face Transformers) and libraries for quick text processing. 

For example, use:
- TF-IDF + Logistic Regression for a baseline.
- Pre-trained embeddings (e.g., Word2Vec, GloVe) for better results.
- Fine-tuned BERT if there is access to good hardware.

#### 1.1.9 Projected Conclusion

###  1.2 **Problem statement**

#### 1.2.1  Business Problem:
- In today’s digital world, customer feedback plays a critical role in shaping business decisions. Companies receive large volumes of unstructured textual data in the form of reviews, surveys, and social media posts. Analyzing this data manually is time-consuming and error-prone.

The goal of this project is to build a sentiment analysis model that classifies customer feedback as positive, negative, or neutral. 

This will enable businesses to:
- Identify key areas for improvement.
- Tailor marketing strategies based on customer sentiment.
- Monitor brand reputation over time.

###  1.3 **Objectives**

#### 1. **Primary Objective**:
   - Build a machine learning-based sentiment classification model that categorizes tweets as **positive**, **negative**, or **neutral** towards a brand or product.
   
#### 2. **Secondary Objectives**:
   - Identify whether a tweet contains an emotion directed at a specific brand or product.
   - Preprocess and clean the tweet text to remove noise (e.g., hashtags, mentions, and URLs).
   - Extract key textual features that indicate sentiment and brand-related emotions.
   - Provide actionable insights to help businesses improve customer satisfaction and marketing strategies.   
   
###  1.3.1 **Key Questions to Address**
1. How can we preprocess and clean textual data effectively to extract meaningful insights?
2. What are the best features to use (e.g., word embeddings, TF-IDF, or sentiment lexicons) for classifying tweet sentiment?
3. Which supervised learning models (e.g., Logistic Regression, Random Forest, or BERT) perform best for this task?
4. What level of accuracy, precision, and recall can we achieve for sentiment classification?

###  1.4 **Metrics of Success**

To evaluate the success of our sentiment analysis model, we will use metrics such as; accuracy, precision, recall or sensitivity, f1 score and the confusion matrix. To evaluate the performance of the sentiment classification model, we will use the following metrics:
#### 1. **Accuracy**:
   - Accuracy will check at the percentage of the correctly classified instances of sentiments out of the total sentmental            instances.
   - Target: **85% or higher**.

#### 2. **Precision**:
   - Precision will tell the percentage of actually correct positive sentiment predictions, thus telling us how often the model      is correct when it predicts a positive sentiment. The percentage of actual positive sentiments, that are correctly              identified by the model will be shown by recall. This metrics is important to strike a tradeoff between true positives and      false negatives
   - Target: **80% or higher** for each class (positive, negative, neutral).

#### 3. **Recall**:
   - Measure the model’s ability to correctly identify all relevant examples of a specific sentiment.
   - Target: **75% or higher** for each class.

#### 4. **F1-Score**:
   - Provide a balanced metric that considers both precision and recall.
   - Target: **80% or higher** overall.

#### 5. **Business Impact**:
   - Improved customer satisfaction through the identification of key negative sentiments.
   - Better marketing strategies based on trends in positive feedback.


### 2. DATA UNDERSTANDING
* Now we load the data, and proceed with understanding the shape, the basic statistics and the types of variable.
* We write function that we can load the data and get back the shape, info and description with df.shape, df.describe(), df.info() and df.isnull().sum()

#### 2.1 Import Necessary Libraries 

In [1577]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import re
import Project_Functions as Pf
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import words



from nltk.tokenize import word_tokenize

# Download NLTK resources
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('words')


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\tracy\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\tracy\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\tracy\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\tracy\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


True

#### 2.2 General  Dataset Exploration

In [1578]:
# Load and Display the first few rows of the dataset
df = pd.read_csv(r"C:\Users\tracy\Documents\Flatiron\Phase_4_project\Phase_4_Group_9_Project\judge-1377884607_tweet_product_company.csv", encoding='ISO-8859-1')
df 

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,".@wesley83 I have a 3G iPhone. After 3 hrs tweeting at #RISE_Austin, it was dead! I need to upgrade. Plugin stations at #SXSW.",iPhone,Negative emotion
1,"@jessedee Know about @fludapp ? Awesome iPad/iPhone app that you'll likely appreciate for its design. Also, they're giving free Ts at #SXSW",iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. They should sale them down at #SXSW.,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as crashy as this year's iPhone app. #sxsw,iPad or iPhone App,Negative emotion
4,"@sxtxstate great stuff on Fri #SXSW: Marissa Mayer (Google), Tim O'Reilly (tech books/conferences) &amp; Matt Mullenweg (Wordpress)",Google,Positive emotion
...,...,...,...
9088,Ipad everywhere. #SXSW {link},iPad,Positive emotion
9089,"Wave, buzz... RT @mention We interrupt your regularly scheduled #sxsw geek programming with big news {link} #google #circles",,No emotion toward brand or product
9090,"Google's Zeiger, a physician never reported potential AE. Yet FDA relies on physicians. &quot;We're operating w/out data.&quot; #sxsw #health2dev",,No emotion toward brand or product
9091,Some Verizon iPhone customers complained their time fell back an hour this weekend. Of course they were the New Yorkers who attended #SXSW.,,No emotion toward brand or product


In [1579]:
# Show the data information
Pf.check_Info(df)

(9093, 3)
Index(['tweet_text', 'emotion_in_tweet_is_directed_at',
       'is_there_an_emotion_directed_at_a_brand_or_product'],
      dtype='object')
tweet_text                                            object
emotion_in_tweet_is_directed_at                       object
is_there_an_emotion_directed_at_a_brand_or_product    object
dtype: object
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column                                              Non-Null Count  Dtype 
---  ------                                              --------------  ----- 
 0   tweet_text                                          9092 non-null   object
 1   emotion_in_tweet_is_directed_at                     3291 non-null   object
 2   is_there_an_emotion_directed_at_a_brand_or_product  9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB
None
tweet_text                                               1
emotion_in_tweet_is_directed_at          

In [1580]:
# Check for unique categories in emotion_in_tweet_is_directed_at column
df['emotion_in_tweet_is_directed_at'].unique()

array(['iPhone', 'iPad or iPhone App', 'iPad', 'Google', nan, 'Android',
       'Apple', 'Android App', 'Other Google product or service',
       'Other Apple product or service'], dtype=object)

In [1581]:
# Check for unique categories in Emotion-directed column
df['is_there_an_emotion_directed_at_a_brand_or_product'].unique()

array(['Negative emotion', 'Positive emotion',
       'No emotion toward brand or product', "I can't tell"], dtype=object)

* The dataset comprise of 9093 rows and 3 columns; the **tweet_text**,**emotion_in_tweet_is_directed_at** and **is_there_an_emotion_directed_at_a_brand_or_product**. 
* The tweet_text column contain the tweet or the text written on the twitter platform. The emotiom_in_tweet_is_directed_at column shows items produced as products or evene services by Apple and Google, that the tweets were directed at. The last column shows whether the tweet written had a positive, negative or neutral impact. 
* All the columns are of the object data type.
* There are 5802 missing entries in the 'emotion_in_tweet_is_directed_at' column and one missing entry in the tweet_text column.
* There are 22 duplicated entries.
* Additionally unique categories were identified in the emotion_in_tweet_is_directed_at column and is_there_an_emotion_directed_at_a_brand_or_product column

### 3. DATA PREPARATION
The data understanding section above checked for non null values, duplicates to gain surface level insights. This section delves into data preparation by performing various transformations suitable format for modelling.
But first we need to do a bit of data cleaning.
####  3.1  Data Cleaning
1. Deal with the missing values in the tweet_text and emotion_in_tweet_is_directed_at columns
2. Deal with Duplicates
3. Dealing with the text case
4. Further cleaning and transformation; Removing specific words and numbers in the text.

#####  3.1.1  Duplicates
* Drop the duplicated rows.
Rationale; The total number of duplictes, i.e 22. we remove them to maintain the integrity of our data set and only ensure only unique observations are considered.

In [1582]:
df = df.drop_duplicates(keep='first')

#####  3.1.2  Missing Values
* Drop the row with missing values in the tweet_text column. Implement the use of the dropna() pandas method.
* The 'emotion_in_tweet_is_directed_at' column, require strict check in regard to its contribution to the final model. Check for the percentage of the missing values; above 50%. With the trade off between droping this column and retaining it, try a method to get an absolute and rational values for the missing entries. 
* Loop through the tweet_text column texts and check for a probable entry. Fill in this entries to a new column. To do this, first, clean the tweet_text column and create a column for the cleaned text, then extract the possible entries.

In [1583]:
# check the missing values in the tweet_text column
df[df['tweet_text'].isna()]

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
6,,,No emotion toward brand or product


In [1584]:
# Drop the missing value in tweet_text column. Only one was identified above
df = df.dropna(subset = ['tweet_text'])

In [1585]:
# Check if missing value is removed in tweet , also check remaining missing values in other columns
Pf.check_for_missing_values(df)

tweet_text                                               0
emotion_in_tweet_is_directed_at                       5788
is_there_an_emotion_directed_at_a_brand_or_product       0
dtype: int64


In [1586]:
# check for the percentage distribution of missing values
missing_percentage = df['emotion_in_tweet_is_directed_at'].isna().value_counts(normalize = True)* 100
# Format the percentages to include the '%' symbol
missing_percentage = missing_percentage.apply(lambda x: f"{x:.2f}%")
missing_percentage


emotion_in_tweet_is_directed_at
True     63.81%
False    36.19%
Name: proportion, dtype: object

#####  3.1.2.1  Dealing with the text column
1. Basic cleaning: removing capitalization, special characters such as ?,;., converting to lower case
2. Tokenizing our texts column
3. create a new column with joined words
4. removing the stopwords

* First extract the words in the tweet_text column that starts with '@' and those starting with '#'. words starting with @ refers to the person who tweeted while those those starting with # refers to those who were tagged. worth notting that these words in our texts will be adding noise to our data_set.
* Extract the users and the tagged into separate columns.

In [1587]:

import re
# Regular expression to extract Twitter usernames
pattern = r"@\w+"
pattern_2 = r'#\w+'

# Extract usernames from the 'Tweets' column
df['Usernames'] = df['tweet_text'].apply(lambda x: re.findall(pattern, x))
df['Tagged_Names'] = df['tweet_text'].apply(lambda x: re.findall(pattern_2, x))

df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,Usernames,Tagged_Names
0,".@wesley83 I have a 3G iPhone. After 3 hrs tweeting at #RISE_Austin, it was dead! I need to upgrade. Plugin stations at #SXSW.",iPhone,Negative emotion,[@wesley83],"[#RISE_Austin, #SXSW]"
1,"@jessedee Know about @fludapp ? Awesome iPad/iPhone app that you'll likely appreciate for its design. Also, they're giving free Ts at #SXSW",iPad or iPhone App,Positive emotion,"[@jessedee, @fludapp]",[#SXSW]
2,@swonderlin Can not wait for #iPad 2 also. They should sale them down at #SXSW.,iPad,Positive emotion,[@swonderlin],"[#iPad, #SXSW]"
3,@sxsw I hope this year's festival isn't as crashy as this year's iPhone app. #sxsw,iPad or iPhone App,Negative emotion,[@sxsw],[#sxsw]
4,"@sxtxstate great stuff on Fri #SXSW: Marissa Mayer (Google), Tim O'Reilly (tech books/conferences) &amp; Matt Mullenweg (Wordpress)",Google,Positive emotion,[@sxtxstate],[#SXSW]


#####  3.1.2.2  Converting the tweet_text to lower case
* Write a function to access the tweets in the column, tweet_text, and lower case.
* Remove the usernames and tagged names in the texts and create the column 'clean_tweet_text' for the cleaned text

In [1588]:
# Transform the whole dataset (df[tweet_text]) to lowercase
df["tweet_text"] = df["tweet_text"].str.lower()
# Display full text: uncomment the code below to display the whole texts
#df.style.set_properties(**{'text-align': 'left'})

In [1589]:
# Create a function that removes words starting with @ and #
def remove_words_with_at(text):
    # Use a regular expression to remove words containing "@" and words starting with "#"
    cleaned_text = re.sub(r'\S*@\S*|#\w+','', text)
  
    return cleaned_text
# Create new column with tokenized data
df["clean_tweet_text"] = df["tweet_text"].apply(remove_words_with_at)
# Display full text: uncomment the code below to display the whole texts
#df.style.set_properties(**{'text-align': 'left'})

* Now loop through the tweet_text column and check for a probable entry. Fill in this entries to a new column, category_words.

In [1590]:
categories = ['iPhone', 'iPad or iPhone App', 'iPad', 'Google','Android',
       'Apple', 'Android App', 'Other Google product or service',
       'Other Apple product or service']

In [1591]:
def extract_category_words(tweet, categories):
    # Tokenize and check for category words
    extracted_words = []
    for category in categories:
        if category.lower() in tweet.lower():
            extracted_words.append(category)
    return " ".join(extracted_words)
df['Tweet_Directed_at'] = df['clean_tweet_text'].apply(lambda x: extract_category_words(x, categories))
# Check for the value counts in the new category column
df['Tweet_Directed_at'].value_counts()

Tweet_Directed_at
Google                        2156
                              1781
iPad                          1713
Apple                         1191
iPhone                        1040
iPad Apple                     568
Android                        240
iPhone Android                 118
iPhone iPad                    103
Android Android App             30
iPad Android                    23
iPhone Apple                    23
Google Apple                    23
Google Android                  15
iPad Google                     10
iPhone Android Android App      10
iPhone iPad Android              8
Android Apple                    7
iPhone iPad Apple                4
iPhone Google                    3
iPhone iPad Google               2
iPhone Google Android            2
Name: count, dtype: int64

In [1592]:
# Rearranging the dataframe.
df= df[['tweet_text', 'clean_tweet_text','emotion_in_tweet_is_directed_at','Tweet_Directed_at',
       'is_there_an_emotion_directed_at_a_brand_or_product']]
df.head(2)

Unnamed: 0,tweet_text,clean_tweet_text,emotion_in_tweet_is_directed_at,Tweet_Directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,".@wesley83 i have a 3g iphone. after 3 hrs tweeting at #rise_austin, it was dead! i need to upgrade. plugin stations at #sxsw.","i have a 3g iphone. after 3 hrs tweeting at , it was dead! i need to upgrade. plugin stations at .",iPhone,iPhone,Negative emotion
1,"@jessedee know about @fludapp ? awesome ipad/iphone app that you'll likely appreciate for its design. also, they're giving free ts at #sxsw","know about ? awesome ipad/iphone app that you'll likely appreciate for its design. also, they're giving free ts at",iPad or iPhone App,iPhone iPad,Positive emotion


* From above display of the value counts, the category_words columns have 1781 null entries.
* Loop through the 'category_words' to determine which entries were null. For the indices of the null entries, check if they are present in the 'emotion_in_tweet_is_directed_at' column. If present, fill in the null entries in the category_words.
* Access the value counts of the category_words column to verify the decrease in the null entries. Notice a decrease in the number of null entries.
* Proceed to drop the rows with null entries.

In [1593]:
# Check for rows where emotion_in_tweet_is_directed_at has data, but clean_tweet_text is blank
condition_a_data_b_blank = (df['emotion_in_tweet_is_directed_at'].notna() & (df['emotion_in_tweet_is_directed_at'] != "") & 
                            (df['Tweet_Directed_at'].isna() | (df['Tweet_Directed_at'] == "")))


condition_a_data_b_blank.value_counts()

False    8717
True      353
Name: count, dtype: int64

In [1594]:
# Update the 'category_words' column with the values from 'emotion_in_tweet_is_directed_at'
df.loc[condition_a_data_b_blank, 'Tweet_Directed_at'] = df.loc[condition_a_data_b_blank, 'emotion_in_tweet_is_directed_at']
# Uncomment the cell below to show the value counts
df['Tweet_Directed_at'].value_counts()

Tweet_Directed_at
Google                             2196
iPad                               1801
                                   1428
Apple                              1263
iPhone                             1058
iPad Apple                          568
Android                             259
iPhone Android                      118
iPhone iPad                         103
iPad or iPhone App                   74
Android Android App                  30
iPad Android                         23
iPhone Apple                         23
Google Apple                         23
Android App                          16
Google Android                       15
Other Google product or service      13
Other Apple product or service       13
iPad Google                          10
iPhone Android Android App           10
iPhone iPad Android                   8
Android Apple                         7
iPhone iPad Apple                     4
iPhone Google                         3
iPhone iPad Google    

* Create a mapping for the either google products, apple products, unknown and IRR. Map this dictionary to create a column to show which company the tweet was directed at.

In [1595]:
# Define mapping for fewer categories
category_mapping = {
    'Google': 'Google Products',
    'iPad': 'Apple Products',
    '': 'Unknown',
    'Apple': 'Apple Products',
    'iPhone': 'Apple Products',
    'iPad Apple': 'Apple Products',
    'Android': 'Google Products',
    'iPhone Android': 'IRR',
    'iPhone iPad': 'Apple Products',
    'iPad or iPhone App': 'Apple Products',
    'Android Android App': 'Google Products',
    'iPad Android': 'IRR',
    'iPhone Apple': 'Apple Products',
    'Google Apple': 'IRR',
    'Android App': 'Google Products',
    'Google Android': 'Google Products',
    'Other Google product or service': 'Google Products',
    'Other Apple product or service': 'Apple Products',
    'iPad Google': 'IRR',
    'iPhone Android Android App': 'IRR',
    'iPhone iPad Android': 'IRR',
    'Android Apple': 'IRR',
    'iPhone iPad Apple': 'Apple Products',
    'iPhone Google': 'IRR',
    'iPhone iPad Google': 'IRR',
    'iPhone Google Android': 'IRR'
}


# Apply mapping to the dataframe
df['Company_Product'] = df['Tweet_Directed_at'].map(category_mapping)
df.head()
# Display the result
df['Company_Product'].value_counts(normalize= True)


Company_Product
Apple Products     0.541014
Google Products    0.278831
Unknown            0.157442
IRR                0.022712
Name: proportion, dtype: float64

In [1596]:
# checking for any null values in the Company_product column
df['Company_Product'].isna().sum()

0

* We have now created a column where we have been able to map the various products. The missing values/ null values are now denoted with the word "Unknown".
* Proceed to drop the "Unknown" and "IRR" which represents irrelevant categories in the column.
* Then proceed to drop the 'emotion_in_tweet_is_directed_at' column, because the 'Company_product' is a better representative of this column. Now we have successfully dealt with the missing values. Proceed to dealing with the clean_tweet_text column semantic analysis.

In [1597]:
# Droping rows with null entries
df = df[(df['Company_Product'] != 'Unknown') & (df['Company_Product'] != 'IRR')]
df = df.reset_index(drop=True)
# Confirm the null entries
Pf.check_for_missing_values(df)

tweet_text                                               0
clean_tweet_text                                         0
emotion_in_tweet_is_directed_at                       4216
Tweet_Directed_at                                        0
is_there_an_emotion_directed_at_a_brand_or_product       0
Company_Product                                          0
dtype: int64


In [1598]:
# drop the emotion_in_tweet_is_directed_at column 
df.drop('emotion_in_tweet_is_directed_at', axis=1, inplace=True)
df

Unnamed: 0,tweet_text,clean_tweet_text,Tweet_Directed_at,is_there_an_emotion_directed_at_a_brand_or_product,Company_Product
0,".@wesley83 i have a 3g iphone. after 3 hrs tweeting at #rise_austin, it was dead! i need to upgrade. plugin stations at #sxsw.","i have a 3g iphone. after 3 hrs tweeting at , it was dead! i need to upgrade. plugin stations at .",iPhone,Negative emotion,Apple Products
1,"@jessedee know about @fludapp ? awesome ipad/iphone app that you'll likely appreciate for its design. also, they're giving free ts at #sxsw","know about ? awesome ipad/iphone app that you'll likely appreciate for its design. also, they're giving free ts at",iPhone iPad,Positive emotion,Apple Products
2,@swonderlin can not wait for #ipad 2 also. they should sale them down at #sxsw.,can not wait for 2 also. they should sale them down at .,iPad,Positive emotion,Apple Products
3,@sxsw i hope this year's festival isn't as crashy as this year's iphone app. #sxsw,i hope this year's festival isn't as crashy as this year's iphone app.,iPhone,Negative emotion,Apple Products
4,"@sxtxstate great stuff on fri #sxsw: marissa mayer (google), tim o'reilly (tech books/conferences) &amp; matt mullenweg (wordpress)","great stuff on fri : marissa mayer (google), tim o'reilly (tech books/conferences) &amp; matt mullenweg (wordpress)",Google,Positive emotion,Google Products
...,...,...,...,...,...
7431,"@mention yup, but i don't have a third app yet. i'm on android, any suggestions? #sxsw cc: @mention","yup, but i don't have a third app yet. i'm on android, any suggestions? cc:",Android,No emotion toward brand or product,Google Products
7432,ipad everywhere. #sxsw {link},ipad everywhere. {link},iPad,Positive emotion,Apple Products
7433,"google's zeiger, a physician never reported potential ae. yet fda relies on physicians. &quot;we're operating w/out data.&quot; #sxsw #health2dev","google's zeiger, a physician never reported potential ae. yet fda relies on physicians. &quot;we're operating w/out data.&quot;",Google,No emotion toward brand or product,Google Products
7434,some verizon iphone customers complained their time fell back an hour this weekend. of course they were the new yorkers who attended #sxsw.,some verizon iphone customers complained their time fell back an hour this weekend. of course they were the new yorkers who attended .,iPhone,No emotion toward brand or product,Apple Products


In [1599]:
# Checking the tweet_text "is_there_an_emotion_directed_at_a_brand_or_product" column
# Recall that this column also has unique values , we will look at them
df['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts()

is_there_an_emotion_directed_at_a_brand_or_product
No emotion toward brand or product    3887
Positive emotion                      2856
Negative emotion                       555
I can't tell                           138
Name: count, dtype: int64

In [1600]:
# Check for NaNs
df['is_there_an_emotion_directed_at_a_brand_or_product'].isna().sum()

0

In [1601]:
# We will now streamline the categories into shorter forms such as "positive","negative","Neutral" 
# and "Unknown" using mapping
Updated_emotion_categories = {'Negative emotion': 'Negative', 'Positive emotion': 'Positive', 
                'No emotion toward brand or product': 'Neutral', 
                "I can't tell": 'Unknown'}
df['is_there_an_emotion_directed_at_a_brand_or_product'] = df['is_there_an_emotion_directed_at_a_brand_or_product'].map(Updated_emotion_categories)
df.head()

Unnamed: 0,tweet_text,clean_tweet_text,Tweet_Directed_at,is_there_an_emotion_directed_at_a_brand_or_product,Company_Product
0,".@wesley83 i have a 3g iphone. after 3 hrs tweeting at #rise_austin, it was dead! i need to upgrade. plugin stations at #sxsw.","i have a 3g iphone. after 3 hrs tweeting at , it was dead! i need to upgrade. plugin stations at .",iPhone,Negative,Apple Products
1,"@jessedee know about @fludapp ? awesome ipad/iphone app that you'll likely appreciate for its design. also, they're giving free ts at #sxsw","know about ? awesome ipad/iphone app that you'll likely appreciate for its design. also, they're giving free ts at",iPhone iPad,Positive,Apple Products
2,@swonderlin can not wait for #ipad 2 also. they should sale them down at #sxsw.,can not wait for 2 also. they should sale them down at .,iPad,Positive,Apple Products
3,@sxsw i hope this year's festival isn't as crashy as this year's iphone app. #sxsw,i hope this year's festival isn't as crashy as this year's iphone app.,iPhone,Negative,Apple Products
4,"@sxtxstate great stuff on fri #sxsw: marissa mayer (google), tim o'reilly (tech books/conferences) &amp; matt mullenweg (wordpress)","great stuff on fri : marissa mayer (google), tim o'reilly (tech books/conferences) &amp; matt mullenweg (wordpress)",Google,Positive,Google Products


In [1602]:
# Get the value counts for the emotion-directed column
Count_emotion = df['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts()

# Calculate the percentage for each category
percentage_emotion = df['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts(normalize=True) * 100

# Combine both count and percentage into a DataFrame
result_emotion = pd.DataFrame({
    'Count': Count_emotion,
    'Percentage': percentage_emotion.apply(lambda x: f"{x:.2f}%")  # Format percentage to 2 decimal places
})

result_emotion


Unnamed: 0_level_0,Count,Percentage
is_there_an_emotion_directed_at_a_brand_or_product,Unnamed: 1_level_1,Unnamed: 2_level_1
Neutral,3887,52.27%
Positive,2856,38.41%
Negative,555,7.46%
Unknown,138,1.86%


In [1603]:
# Checking the distribution of "Unknown" emotion category
pd.set_option("display.max_colwidth", 300)
Emotion_Unknown = df[df['is_there_an_emotion_directed_at_a_brand_or_product']=='Unknown']
Emotion_Unknown


Unnamed: 0,tweet_text,clean_tweet_text,Tweet_Directed_at,is_there_an_emotion_directed_at_a_brand_or_product,Company_Product
76,ûï@mention &quot;apple has opened a pop-up store in austin so the nerds in town for #sxsw can get their new ipads. {link} #wow,&quot;apple has opened a pop-up store in austin so the nerds in town for can get their new ipads. {link},iPad Apple,Unknown,Apple Products
189,"just what america needs. rt @mention google to launch major new social network called circles, possibly today {link} #sxsw","just what america needs. rt google to launch major new social network called circles, possibly today {link}",Google,Unknown,Google Products
286,the queue at the apple store in austin is four blocks long. crazy stuff! #sxsw,the queue at the apple store in austin is four blocks long. crazy stuff!,Apple,Unknown,Apple Products
308,hope it's better than wave rt @mention buzz is: google's previewing a social networking platform at #sxsw: {link},hope it's better than wave rt buzz is: google's previewing a social networking platform at : {link},Google,Unknown,Google Products
344,syd #sxsw crew your iphone extra juice pods have been procured.,syd crew your iphone extra juice pods have been procured.,iPhone,Unknown,Apple Products
...,...,...,...,...,...
7374,it's funny watching a room full of people hold their ipad in the air to take a photo. like a room full of tablets staring you down. #sxsw,it's funny watching a room full of people hold their ipad in the air to take a photo. like a room full of tablets staring you down.,iPad,Unknown,Apple Products
7382,"@mention yeah, we have @mention , google has nothing on us :) #sxsw","yeah, we have , google has nothing on us :)",Google,Unknown,Google Products
7387,"@mention yes, the google presentation was not exactly what i was expecting. #sxsw","yes, the google presentation was not exactly what i was expecting.",Google,Unknown,Google Products
7405,&quot;do you know what apple is really good at? making you feel bad about your xmas present!&quot; - seth meyers on ipad2 #sxsw #doyoureallyneedthat?,&quot;do you know what apple is really good at? making you feel bad about your xmas present!&quot; - seth meyers on ipad2 ?,iPad Apple,Unknown,Apple Products


Given the first 5 rows, it is clear that, it is not clear what emotion is really being captured in the tweet, Some of the tweets might be sarcastic or actually genuine but we cannot tell. However from the value count seen above , the number of the "Unknown" emotion is not significant(1.8% of the total data in the "emotion" column). It is therefore more prudent to drop them. 


In [1604]:

df=df[df['is_there_an_emotion_directed_at_a_brand_or_product']!='Unknown']
df['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts()

is_there_an_emotion_directed_at_a_brand_or_product
Neutral     3887
Positive    2856
Negative     555
Name: count, dtype: int64

####  3.2 Additiona Data Cleaning & EDA with NLTK

For this section we will start by focusing on Positive and the Negative emotions captured in the "is_there_an_emotion_directed_at_a_brand-or_product column as it is our objective to analyze those two categories.

We will use first start by removing the following from the clean_tweet_text column :

1. Remove URLs
2. remove non-alphanumeric characters
3. Remove numbers/ digits

#####  3.2.1
* In this section we access the clean_tweet_text column check for the appropriate checks in the text. For instance, removal of special characters, Urls and numbers/digits.
* Create a function that removes all the instances above.
* Access specific text, perform the transformations and apply to the whole dataframe.

In [1605]:
sentence = df["clean_tweet_text"][13]
sentence

'great  ipad app from  http://tinyurl.com/4nqv92l'

def cleaning_tokens(text):
    if not isinstance(text, str):
        return ''
    token = re.sub(r'^\d+[a-zA-Z]+$', '', text) # Number followed by letters (e.g., 141st)
    token = re.sub(r'^[a-zA-Z]+\d+\d+$', '', text) # Letters followed by number followed by numbers (e.g., abc12345)
    token = re.sub(r'^\d+(\.\d+)?$', '', text) # Any number (integer or decimal)
    token = re.sub(r'https?:\/\/\S+|www\.\S+', '', text) # Match URLs starting with http(s):// or www.
    token = re.sub(r'^[a-zA-Z]+\d+[a-zA-Z]+$', '', text)  # Letters followed by number followed by letters (e.g., abc123def)
    return token



def cleaning_tokens(text):
    if not isinstance(text, str):
        return ''
    token = re.sub(r'^\d+[a-zA-Z]+$', '', text) # Number followed by letters (e.g., 141st)
    token = re.sub(r'^[a-zA-Z]+\d+\d+$', '', text) # Letters followed by number followed by numbers (e.g., abc12345)
    token = re.sub(r'^\d+(\.\d+)?$', '', text) # Any number (integer or decimal)
    token = re.sub(r'^(http[s]?://\S+)$', '', text, flags=re.MULTILINE) # Match URLs starting with http(s):// or www.
    token = re.sub(r'^[a-zA-Z]+\d+[a-zA-Z]+$', '', text)  # Letters followed by number followed by letters (e.g., abc123def)
    return token

cleaned_sentence = cleaning_tokens(sentence)
cleaned_sentence

In [1606]:
sentence = df["clean_tweet_text"][644]
sentence

'google to launch major new social network called circles, possibly today {link}  rt  via '

In [1607]:
# Import the regexptokenizer
# Check how it works on the example sentence above
from nltk.tokenize import RegexpTokenizer

basic_token_pattern = r"(?u)\b\w\w+\b"

tokenizer = RegexpTokenizer(basic_token_pattern)
tokenizer.tokenize(sentence)

['google',
 'to',
 'launch',
 'major',
 'new',
 'social',
 'network',
 'called',
 'circles',
 'possibly',
 'today',
 'link',
 'rt',
 'via']

* Notice that the regexpTokenizer removes all the special characters, including urls. It splits the specific words and puts them into a list. However the it does not remove the stopwords/ filler words and numbers. 

In [1608]:
# Create new column with tokenized data
df["text_tokenized"] = df["clean_tweet_text"].apply(tokenizer.tokenize)
# Display full text
#df.style.set_properties(**{'text-align': 'left'})
df.head(3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["text_tokenized"] = df["clean_tweet_text"].apply(tokenizer.tokenize)


Unnamed: 0,tweet_text,clean_tweet_text,Tweet_Directed_at,is_there_an_emotion_directed_at_a_brand_or_product,Company_Product,text_tokenized
0,".@wesley83 i have a 3g iphone. after 3 hrs tweeting at #rise_austin, it was dead! i need to upgrade. plugin stations at #sxsw.","i have a 3g iphone. after 3 hrs tweeting at , it was dead! i need to upgrade. plugin stations at .",iPhone,Negative,Apple Products,"[have, 3g, iphone, after, hrs, tweeting, at, it, was, dead, need, to, upgrade, plugin, stations, at]"
1,"@jessedee know about @fludapp ? awesome ipad/iphone app that you'll likely appreciate for its design. also, they're giving free ts at #sxsw","know about ? awesome ipad/iphone app that you'll likely appreciate for its design. also, they're giving free ts at",iPhone iPad,Positive,Apple Products,"[know, about, awesome, ipad, iphone, app, that, you, ll, likely, appreciate, for, its, design, also, they, re, giving, free, ts, at]"
2,@swonderlin can not wait for #ipad 2 also. they should sale them down at #sxsw.,can not wait for 2 also. they should sale them down at .,iPad,Positive,Apple Products,"[can, not, wait, for, also, they, should, sale, them, down, at]"


In [1611]:
# Create Sentence to test how the function is working
sentence = df["text_tokenized"][13]
sentence

['great', 'ipad', 'app', 'from', 'http', 'tinyurl', 'com', '4nqv92l']

In [1612]:
def cleaning_tokens(text):
    if not isinstance(text, str):
        return ''
    token = re.sub(r'^\d+[a-zA-Z]+$', '', text) # Number followed by letters (e.g., 141st)
    token = re.sub(r'^[a-zA-Z]+\d+\d+$', '', text) # Letters followed by number followed by numbers (e.g., abc12345)
    token = re.sub(r'^\d+(\.\d+)?$', '', text) # Any number (integer or decimal)
    token = re.sub(r'http[s]?:\/\/\S+|www\.\S+', '', text) # Match URLs starting with http(s):// or www.
    token = re.sub(r'^[a-zA-Z]+\d+[a-zA-Z]+$', '', text)  # Letters followed by number followed by letters (e.g., abc123def)
    return token

In [1613]:
cleaned_sentence = cleaning_tokens(sentence)
cleaned_sentence

''

In [1576]:
# Create Sentence to test how the function is working
sentence = df["text_tokenized"][644]
sentence

KeyError: 'text_tokenized'

# Clean text (remove unwanted characters and convert to lowercase)
def clean_text(text):
    if not isinstance(text, str):
        return ''
    text = re.sub(r'@[A-Za-z0-9_]+', '', text)  # confirm mentions (@user)
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove non-alphabetic characters
    text = text.lower()  # Convert text to lowercase
    return text

#df['cleaned_tweet'] = df['tweet_text'].apply(clean_text)
cleaned_sentence = clean_text(sentence)
cleaned_sentence

df["text_tokenized"]= df["text_tokenized"].apply(clean_text)
df

In [None]:
#displaying 15 most common tokens
from nltk import FreqDist
tokens_count = [token for tokens in df["text_tokenized"] for token in tokens]
freq = FreqDist(tokens_count)
freq.most_common(15)

Before we proceed with further cleaning of the text, it is clear from the above results that stopwords dominated the texts and therefore need to be removed. It is also noted there are words such as "Link" that need to be investigated to determine whether it refers to a link to a website or something else. We will also check on the letters "rt"

In [None]:
text = df["text_tokenized"][1]
text

In [None]:
## Get the list of English stop words
stop_words = stopwords.words('english')
stop_words += list()
# Function to remove stop words
def remove_stop_words(tokens_list):
    """
    Removes stop words from a list of tokens.
    Arguments:
    - tokens_list: List of tokenized words.
    
    Returns:
    - List of tokens without stop words.
    """
    return [word for word in tokens_list if word.lower() not in stop_words]

# Apply the function to the tokenized text column
df["text_tokenized"] = df["text_tokenized"].apply(remove_stop_words)
df.head()

In [None]:
#displaying 10 most common tokens
from nltk import FreqDist
tokens_count = [token for tokens in df["text_tokenized"] for token in tokens]
freq = FreqDist(tokens_count)
freq.most_common(10)

In [None]:
# Search for rows where 'link' is in the tokenized text
matching_rows = df[df['text_tokenized'].apply(lambda tokens: 'link' in tokens)]

# Display the first 5 tweet texts where 'link' is found
print(matching_rows['tweet_text'].head(5))

In [None]:
# Search for rows where 'link' is in the tokenized text
matching_rows = df[df['text_tokenized'].apply(lambda tokens: 'rt' in tokens)]

# Display the first 5 tweet texts where 'link' is found
print(matching_rows['tweet_text'].head(5))

From the above we can see that the word link refers to URLS or HTTPs and the rt refers to retweet. This is not necessary for our analysis. Therefore, we will add to our stopwords list and remove them.

In [1432]:
stop_words += ['link', 'rt']




In [None]:
df["text_tokenized"] = df["text_tokenized"].apply(remove_stop_words)

In [None]:
# Search for rows where 'link' is in the tokenized text
matching_rows = df[df['text_tokenized'].apply(lambda tokens: 'link' in tokens)]

# Display the first 5 tweet texts where 'link' is found
print(matching_rows['tweet_text'].head(5))

In [None]:
# Search for rows where 'link' is in the tokenized text
matching_rows = df[df['text_tokenized'].apply(lambda tokens: 'rt' in tokens)]

# Display the first 5 tweet texts where 'link' is found
print(matching_rows['tweet_text'].head(5))

In [None]:
#displaying 15 most common tokens
from nltk import FreqDist
tokens_count = [token for tokens in df["text_tokenized"] for token in tokens]
freq = FreqDist(tokens_count)
freq.most_common(15)

After removing 'link' and 'rt', one can see a more refined group of words above.
From the results above we can already see ipad and google being the most popular

Further to removing 'link' and 'rt' , we will proceed to remove one letter and two letter words , except the word 'no' which we have determined can be used to convey emotion.
This way we will minimize having unnessary words.

In [None]:
# Function to remove two-letter unnecessary words like 'at', 'i' and others, keeping exceptions like 'no'
def remove_two_letter_words(tokens):
    return [word for word in tokens if len(word) != 2 or word.lower() == 'no']

# Apply the function to the 'text_tokenized' column
df['text_tokenized'] = df['text_tokenized'].apply(remove_two_letter_words)
df.head()

Proceed to lematization : Lematization is prefered to stemming because it is reduces a word to its base or root form (lemma) based on its linguistic meaning and context and not like stemming which reduces a word to its root form by stripping affixes, without considering linguistic context.


In [None]:
# Apply lemmatization to the 'text_tokenized' column
# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

# Function to lemmatize tokens
def lemmatize_tokens(tokens):
    return [lemmatizer.lemmatize(token) for token in tokens]

# Apply lemmatization to the 'text_tokenized' column
df['text_tokenized_lemmatized'] = df['text_tokenized'].apply(lemmatize_tokens)


In [None]:
df.head(10)

In [1441]:
df.to_excel('cleaned_dataframe.xlsx', index=False)

### EXPLORATORY DATA ANALYSIS 
##### 1. Looking at the most popular words
##### 2. Looking at popular words tied to positive emotion
##### 3. Looking at popular words tired to negative emotion
##### 4. Distribution of positive , negative and neutral emotion for both Apple and Google
##### 5. Looking at distribution of words tired to Apple and Google
##### 6. Looking at how Apple and Google is tired to a positive emotion 
##### 7. Looking at how Apple and Google is tired to a negative emotion

In [None]:
# Looking at the top 10 most popular words, @user_accounts and hashtags
# 25 Most popular words

from nltk import FreqDist
tokens_count = [token for tokens in df["text_tokenized"] for token in tokens]
freq = FreqDist(tokens_count)
freq.most_common(10)


In [None]:
# Function to remove numerical values using regex
def remove_numbers(text):
    return re.sub(r'\d+','', text)
# Check if the function works with our sentence
clean_sentence = remove_numbers(sentence)
clean_sentence