## BUSINESS UNDERSTANDING

Overview
Kenya Power and Lighting Company (KPLC) often receives a high volume of tweets from customers reporting issues, asking questions, or providing feedback.Understanding customer sentiment towards KPLC is crucial to enable automating of responses, enhancing customer service efficiency, improving response times, and reduce the manual workload on customer service teams. The goal is to develop a chatbot capable of classifying various types of tweets and generating appropriate automated responses.


## Problem Statement
KPLC needs an automated sentiment analysis system to process and categorize customer feedback from social media, particularly X formerly (Twitter) where customers frequently express their sentiments regarding KPLC's services. By accurately classifying tweets related to KPLC’s services into sentiment categories the system will be able to identify issues by pinpointing common complaints and service issues and enhance customer feedback

### Objectives

* To gauge overall customer sentiment towards KPLC's services.·   

* To Identify specific issues mentioned in the tweets, such as token problems, power outages, billing issues, etc.

* To Create a chatbot that provides appropriate responses to customer inquiries


### Challenges
1. Data Collection and Preprocessing:
Gathering relevant tweets mentioning KPLC, especially when customers use various hashtags, misspellings or slang, can be difficult. Additionally, cleaning and preprocessing the data (e.g., removing noise like unrelated tweets, abbreviations) is crucial but time-consuming.

2. Sentiment Analysis Accuracy:
Accurately classifying the sentiment of tweets can be challenging due to the informal language, sarcasm, mixed sentiments and local dialects often used on X/Twitter.

3. Identifying Specific Issues:
Extracting and categorizing specific issues (e.g power outages, billing issues) mentioned in tweets can be complex due to the diverse ways in which customers describe their problems.

4. Real-time Data Processing:
Processing a continuous stream of tweets in real-time to provide timely insights and responses is demanding in terms of computational resources and model efficiency.

5. Handling Multilingual and Local Dialects:
Tweets may be in multiple languages or include local dialects, which can complicate sentiment analysis and issue detection. 
6. Evaluating Model Performance:
Ensuring the models perform well across different contexts, languages, and over time requires ongoing evaluation and tuning.




### Proposed Solution

* Use advanced Natural Language Processing (NLP) techniques and APIs (e.g., Twitter API) to collect and preprocess tweets.

* Implement data cleaning scripts to filter out irrelevant data and normalize the text for consistent analysis. 

* Train sentiment analysis models using machine learning techniques such as supervised learning with labeled datasets

* Implement a robust pipeline using tools for real-time data streaming and processing. Integrate with scalable cloud services such as AWS or Google Cloud to ensure the system can handle large volumes of data efficiently.

* Utilize existing chatbot frameworks like Rasa, integrated with the sentiment analysis and issue categorization models. This chatbot should be able to provide relevant responses based on the sentiment and identified issues and direct users to appropriate resources or support channels.

* Incorporate multilingual NLP models and fine-tune them with local dialect data. Using translation APIs where necessary to standardize inputs before analysis.

* Set up a continuous evaluation framework using A/B testing, cross-validation and performance metrics such as accuracy, F1-score and precision/recall. Regularly retrain models with new data to adapt to evolving customer language and sentiment.



### Metrics of success:

* Sentiment Accuracy: Percentage of correctly classified sentiments (positive, negative, neutral).

* Issue Detection Rate: Number of key issues identified and addressed based on sentiment analysis.


### Conclusion
The analysis of the tweets reveals that for Kenya Power and Lightning Company(KPLC),sentiment analysis of the tweets can o along way in assisting the company to understand and deal with customer feedback.In this way,KPLC will be able to focus on identifying the main problems developing and implementing corresponding strategies for the company’s service improvement and ultimately increasing the customer satisfaction level of their customers .The company will be able to maintain their brand image and identify the impending issues before they happen.

Despite the difficulties like dealing with  vast data and identification while analyzing the social media concerns ,performing sentiment analysis by analyzing tweets is effective.Since KLC has established key performance indicators of some of its goals such as raise in customer satisfaction scores and positive trend on brand sentiment,the company can use this tool to sustain its leadership in the energy sector while at the same time strengthening its relations with customers.


## DATA CLEANING

In [6]:
# Importing all the necessary Modules
import os
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import glob
import pandas as pd
import re
import emoji
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from langdetect import detect
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Merging all CSV files into one CSV file.

In [7]:
# Specify the path of CSV files
path = r'C:\Users\USER\Desktop\Data\PHASE5\KPLC'  # Replace with your actual path
all_files = glob.glob(os.path.join(path, "*.csv"))

# Read and combine all CSV files
df_list = [pd.read_csv(file) for file in all_files]
combined_df = pd.concat(df_list, ignore_index=True)

# Save the combined DataFrame to a new CSV file called kplc_df.csv
combined_df.to_csv('kplc_df.csv', index=False)


Now that we have merged all our csvs into one csv file called kplc_df, let us not look at the basic info of our data 

In [8]:
class DataInfo:
    def __init__(self, file_path):
        # Initialize by reading the CSV file into a DataFrame
        self.df = pd.read_csv(file_path)
    
    def get_shape(self):
        # Return the shape of the DataFrame
        shape = self.df.shape
        print(f"Shape of the DataFrame: {shape}")
        return shape
    
    def get_dtypes(self):
        # Return the data types of each column
        dtypes = self.df.dtypes
        print("Data types of each column:")
        print(dtypes)
        return dtypes
    
    def get_missing_values(self):
        # Return the number of missing values per column
        missing_values = self.df.isnull().sum()
        print("Missing values per feature:")
        print(missing_values)
        return missing_values
    
    def get_basic_info(self):
        # Print basic info including shape, data types, and missing values
        print("Basic Information:")
        self.get_shape()
        self.get_dtypes()
        self.get_missing_values()
    
    def remove_irrelevant_columns(self, columns_to_remove):
        # Remove only the columns that exist in the DataFrame
        existing_columns = [col for col in columns_to_remove if col in self.df.columns]
        self.df.drop(columns=existing_columns, inplace=True)
        print(f"Removed columns: {existing_columns}")
        print("Updated DataFrame:")
        print(self.df.head())
        return self.df
    
    def remove_duplicates(self):
        # Remove duplicate entries based on the 'post' column
        initial_shape = self.df.shape
        self.df.drop_duplicates(subset='Post', inplace=True)
        final_shape = self.df.shape
        print(f"Removed {initial_shape[0] - final_shape[0]} duplicate rows.")
        print(f"New shape of the DataFrame: {final_shape}")
        return self.df

# Instantiating our class
data_info = DataInfo("kplc_df.csv")

# Get basic information about the dataset
data_info.get_basic_info()

# Remove irrelevant columns
irrelevant_columns = ['Author', 'Likes', 'Reposts', 'Comments', 'Post Link', 'Profile Links', 'Views', 'Post Link', 'Profile Link', 'Post Body', 'Retweets', 'Tweet URL']
cleaned_df = data_info.remove_irrelevant_columns(irrelevant_columns)

# Remove duplicates in the 'post' column
cleaned_df_no_duplicates = data_info.remove_duplicates()


Basic Information:
Shape of the DataFrame: (4408, 16)
Data types of each column:
Author           object
Handle           object
Media URL        object
Reposts         float64
Likes           float64
Comments        float64
Views            object
Post Link        object
Profile Link     object
Post             object
Date             object
Name             object
Retweets        float64
Tweet URL        object
Post Body        object
Timestamp        object
dtype: object
Missing values per feature:
Author           335
Handle             0
Media URL       4384
Reposts         4150
Likes           3726
Comments        1942
Views           3206
Post Link        326
Profile Link       0
Post             326
Date               0
Name            4082
Retweets        4397
Tweet URL       4082
Post Body       4082
Timestamp       4082
dtype: int64
Removed columns: ['Author', 'Likes', 'Reposts', 'Comments', 'Post Link', 'Views', 'Post Link', 'Profile Link', 'Post Body', 'Retweets', 'Tweet U

Our kplc_df dataset has 9 features and some of the features have missing values like likes, Reposts and comments. We have opted for deletion of these columns since they are irrelevant

We also had different data Types such us floats and Objects. We then checked for duplicates and removed all the duplicated text on post columns

Let us again see how our data looks like and whether there is any missing values

In [9]:
cleaned_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3426 entries, 0 to 4407
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Handle     3426 non-null   object
 1   Media URL  12 non-null     object
 2   Post       3425 non-null   object
 3   Date       3426 non-null   object
 4   Name       1 non-null      object
 5   Timestamp  1 non-null      object
dtypes: object(6)
memory usage: 187.4+ KB


Amazing there are no null values

We can now proceed to the next stage where we are going to remove any emojis present in our dataset, remove punctuation, lowercase all the posts, remove any noise such as mentions and hashtags, filter-out non-english words, lemmitize and then tokenize our data

In [10]:
class TextCleaner:
    def __init__(self, df, text_column):
        """
        Initialize the TextCleaner class with a DataFrame and the text column to clean.
        
        :param df: DataFrame containing the data
        :param text_column: The name of the column to clean
        """
        self.df = df
        self.text_column = text_column
        self.lemmatizer = WordNetLemmatizer()
        self.stop_words = set(stopwords.words('english'))
    
    def handle_emojis(self, text):
        """Replace emojis with corresponding text descriptions."""
        return emoji.demojize(text, delimiters=(" ", " "))
    
    def to_lowercase(self, text):
        """Convert text to lowercase."""
        return text.lower()
    
    def remove_punctuation_numbers(self, text):
        """Remove punctuation and numbers from the text."""
        return re.sub(r'[^a-zA-Z\s]', '', text)
    
    def remove_mentions_hashtags(self, text):
        """Remove mentions (@) and hashtags (#) from the text."""
        return re.sub(r'[@#]\w+', '', text)
    
    def tokenize(self, text):
        """Tokenize the text into words."""
        return word_tokenize(text)
    
    def lemmatize(self, tokens):
        """Lemmatize the tokens."""
        return [self.lemmatizer.lemmatize(token) for token in tokens]
    
    def filter_non_english(self, text):
        """Filter out text that is not in English."""
        try:
            return text if detect(text) == 'en' else ''
        except:
            return ''
    
    def clean_text(self):
        """Apply all cleaning steps to the specified text column in the DataFrame."""
        self.df[self.text_column] = self.df[self.text_column].apply(lambda text: self.clean_single_text(text))
        return self.df
    
    def clean_single_text(self, text):
        """Clean a single piece of text by applying all steps."""
        if pd.isna(text):
            return ''
        text = self.handle_emojis(text)
        text = self.to_lowercase(text)
        text = self.remove_punctuation_numbers(text)
        text = self.remove_mentions_hashtags(text)
        text = self.filter_non_english(text)
        if text:
            tokens = self.tokenize(text)
            lemmatized_tokens = self.lemmatize(tokens)
            return ' '.join(lemmatized_tokens)
        return ''


# Create an instance of TextCleaner for the 'post' column
text_cleaner = TextCleaner(cleaned_df, text_column='Post')

# Clean the text
cleaned_df = text_cleaner.clean_text()

# Optionally, save the cleaned DataFrame to a new CSV file
cleaned_df.to_csv('final_cleaned_kplc_df.csv', index=False)

# Return the cleaned DataFrame
print(cleaned_df.head())  # Display the first few rows to verify

             Handle Media URL  \
0  @Momanyi10908868       NaN   
1   @FrancisKimunya       NaN   
2        @LinaCheps       NaN   
3     @CShihembetsa       NaN   
4     @CShihembetsa       NaN   

                                                Post    Date Name Timestamp  
0                  we are already in a blackout a pm  29-Jul  NaN       NaN  
1  kenyapowercare we are in blackout the whole of...  29-Jul  NaN       NaN  
2          we are in a blackout again please resolve  29-Jul  NaN       NaN  
3  kenyapowercare kenyapower kindly check out pow...  28-Jul  NaN       NaN  
4  kenyapowercare kindly check out power line wev...  28-Jul  NaN       NaN  
