## Text analysis using WordClouds

#### What is Text analysis?

- Text analysis is the process of examining large collections of unstructured textual data, in order to generate new information. Text analysis is also known as "Text Mining".

- The Word cloud model (also known as bag of words model) is a way of extracting features (words) from text, it describes the occurrence of words within a document.

#### The Goal
The goal of this lab is to produce a Wordcloud to analyse customer feedback survey data. 

#### About the dataset
For this lab we will use the "ConsumerSentiment.xlsx" dataset. 

#### Download and Install Python Libraries

In [None]:
#!pip install pandas
#!pip install numpy
#!pip install scikit-learn
#!pip install scipy
#!pip install seaborn
#!pip install matplotlib
#!pip install nltk
#!pip install wordcloud

#### Import Python Libraries

In [None]:
# Start with loading all necessary libraries
import numpy as np
import pandas as pd

from wordcloud import WordCloud
from wordcloud import STOPWORDS

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')





#### Process map
Below illustrates a 5-step process used during this lab.

    1. Import Data
    2. Data Quality Checks
    3. Data Cleansing
    4. Data Pre-processing
    5. Visualisations



#### 1. Import Data

In [None]:
# Reading data from a Excel file and saving that data into a dataframe called "df"

df = pd.read_excel("ConsumerSentiment.xlsx")
df

#### 2. Data Quality Checks

    2.1 Check data
    2.2 Check shape of data
    2.3 Check for duplicates
    2.4 Check for missing values

In [None]:
# 2.1
# Viewing top 5 records

df.head()

In [None]:
# 2.2
# Looking at the structure of the dataframe

df.shape

In [None]:
# 2.3
# Let’s use duplicated() function to identify how many duplicate records there are in the dataset

df.duplicated().sum()

In [None]:
# 2.4
# This method prints out information about a dataframe including the index, dtype, columns, non-null values and memory usage
# This method is also useful for finding out missing values in a dataset
# if found, we can use interpolation techniques to rectify those missing values

df.info()

#### 3. Data Cleansing

In [None]:
# This is how you remove all the duplicates from the dataset using drop_duplicates() function

# df = df.drop_duplicates()

#### 4. Data Pre-processing

    4.1 Create the stopwords list
    4.2 create a corpus

In [None]:
# 4.1
# Create the stopwords list object called "sw"
from wordcloud import STOPWORDS

sw = set(STOPWORDS)
sw

In [None]:
# Adding custom words to the stopwords list "sw"
sw.update(["drink", "now", "flavour"])

In [None]:
# 4.2
# combining all customer reviews to create a corpus
# this corpus will be used during the wordcloud generation process


# Converting Data in "Text" column to string
df["Text"] = df.Text.astype(str)


# Joining all reviews from "Text" column to one big text corpus --> this new object is called "tc"
tc = " ".join(n for n in df.Text)


# Count the total number of words in the corpus
print ("There are {} words in the corpus".format(len(tc)))

In [None]:
# Explore the text corpus "tc" --> let's look at first 1000 characters
tc[0:1000]

#### 5. Visualisations

In [None]:
# Create the wordcloud object called "wd"
from wordcloud import WordCloud

wd = WordCloud(stopwords=sw,
               max_font_size=30, 
               max_words=100,
               random_state=45,
               background_color="white").generate(tc)




In [None]:
# Display the wordcloud using matplotlib

plt.figure(figsize=(12,10))
plt.imshow(wd)
plt.axis("off")
plt.show()

In [None]:
# Save the wordcloud as an image to the default working directory
wd.to_file("CustomerSentimentWordCloud.png")

In [None]:
# To find out the default working directory
# import os
# os.getcwd()