# 1.0. Business Understanding

## 1.1. Overview 
The world today is rife with information flowing from millions of users across different platforms based on a variety of topics including politics, celebrities, data science, wordle, and exercise to make my brain bigger. These opinions on the web garner more and more traffic and gain traction. At the same time, this information reaches a much larger audience who may also share the same information with their networks.

Natural Language Processing is a machine learning technology that gives computers the ability to interpret, manipulate, and comprehend human language. This would be very useful in analysing human made opinions on the web. 

These sentiments across the internet can be analysed using Natural Language Processing methodologies.

Every company/ business with an online presence, and even ones without, require some form of observing, recording, tracking and analysing of these online opinions of their products or services to insure their business public image and ensure that opinions on the web do not burn the palettes of their users, and especially those of the potential users of their products or services, so to speak.

A major mobile vendor who has been collecting sentiments across brands, products and services reached out to us at **SentimentFlow** to address the business problem above.

SentimentFlow leverages cutting-edge NLP techniques to analyze sentiment in textual data, providing valuable insights for decision-making by the management of the vendor.The analysis would be used to determine whether data is positive, negative or neutral. 

## 1.2. Problem Statement
With such a large volume of information shared by and / or received from many users and potential users, business would not be able to keep up with the information received if they attempt to track everything, everywhere all at once, manually.

Without fully comprehending the effects of the publics opinion, the businesses' public image could be tarnished. The poor public image could lead to potentially market share losses, loss of trust from it's repeat consumers, low credibility to its potential clients and also loss of investment/ partnership opportunities.


## 1.3. Proposed Solution
Analysing the public opinion would help businesses monitor their brand and sentiments around their products and services coming in as customer feedback, and understand customer needs, while making them more conscious thus preventing poor public relations.

## 1.4. Objectives
**Main Objective**
> To create a NLP multiclass classification model that can analyse sentiments in either 3 categories - Positive, Negative or Neutral. This model we shall use a recall score of 85% and an accuracy of 90%.

**Specific Objectives**
> - To idenitfy the most common words used in the dataset using Word cloud.
> - To confirm the most used words that are positively and negatively tagged.
> - To recognize the products that have been opined by the users.
> - To spot the distribution of the sentiments.

## 1.5. Contraints

The following potential constraints were identified as below:
- Data Quality - Incomplete data, imbalanced classes and missing Data could affect the overall performance of the models
- Interpretability and Explainability – Interpreting clinical terms could be a challenge to Data scientists for medical decision-making.
- Feature Selection – It's challenging to identify relevant features due to lack of domain knowledge
- Data Privacy - Potentially private but informativ
ujc-qxzp-oif

# 2.0. Data Understanding

## 2.1. Sources
This data was sourced from [Data World](https://data.world/crowdflower/brands-and-product-emotions). 

The data itself is sufficient for the project to run smoothly. 

However, it would have been better if the dataset contained the timestamps for each record.

## 2.2. The Data

## 2.2.1. Libraries

In [12]:
# Data Manipulation
import pandas as pd
import numpy as np

### 2.2.2. Class Creation

In [67]:
class DataUnderstanding():
    """Class that gives the data understanding of a dataset"""
    def __init__(self, df=None):
        if df != None:
            self.df = df
        
    def load_data(self, path):
        self.df = pd.read_csv(path, encoding='latin-1')
        return self.df
    
    def understanding(self):
        # Info
        print("""INFO""")
        print("-"*4)
        self.df.info()
        
        # Shape
        print("""\n\nSHAPE""")
        print("-"*5)
        print(f"Records in dataset are {self.df.shape[0]} with {self.df.shape[1]} columns.")
        
        # Columns
        print("\n\nCOLUMNS")
        print("-"*6)
        print(f"Columns in the dataset are:")
        for idx in self.df.columns:
            print(f"- {idx}")
        
        # Unique Values
        print("\n\nUNIQUE VALUES")
        print("-"*12)
        for col in self.df.columns:
            print(f"Column *{col}* has {self.df[col].nunique()} unique values")
            if self.df[col].nunique() != 9065:
                print(f"Top unique values in the *{col}* include:")
                for idx in self.df[col].value_counts().index:
                    print(f"- {idx}")
            print("")
            
        # Missing or Null Values
        print("""\nMISSING VALUES""")
        print("-"*15)
        for col in self.df.columns:
            print(f"Column *{col}* has {self.df[col].isnull().sum()} missing values.")
            
        # Duplicate Values
        print("""\n\nDUPLICATE VALUES""")
        print("-"*16)
        print(f"The dataset has {self.df.duplicated().sum()} duplicated records.")

### 2.2.3 Data Investigation

In [68]:
data = DataUnderstanding()
df = data.load_data(path="judge_tweet_product_company.csv")
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [69]:
data.understanding()

INFO
----
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column                                              Non-Null Count  Dtype 
---  ------                                              --------------  ----- 
 0   tweet_text                                          9092 non-null   object
 1   emotion_in_tweet_is_directed_at                     3291 non-null   object
 2   is_there_an_emotion_directed_at_a_brand_or_product  9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB


SHAPE
-----
Records in dataset are 9093 with 3 columns.


COLUMNS
------
Columns in the dataset are:
- tweet_text
- emotion_in_tweet_is_directed_at
- is_there_an_emotion_directed_at_a_brand_or_product


UNIQUE VALUES
------------
Column *tweet_text* has 9065 unique values

Column *emotion_in_tweet_is_directed_at* has 9 unique values
Top unique values in the *emotion_in_tweet_is_directed_at* include:
- iPad
- Apple
- iPad or iPhone App
-

##### Comments:
- All the columns are in the correct format
- The columns names will need to be changed
- Features with missing values should be renamed from NaN 
- Duplicate records should be dropped
- All records with the target as "I can't tell" should be dropped