# Ernesto.Net
A data analyst and a data scientist has to spend a lot of time preparing a dataset for any data task because the data we get has a lot of errors, and sometimes it is not labeled. Adding labels to a dataset is very important before you can use it to solve a problem. One of those problems where adding labels to a dataset is very important is sentiment analysis, where you get the data as reviews or comments from users, and you need to add labels to it to prepare it for sentiment analysis. So, if you want to learn how to label unlabeled data, this code is for you. Professor Lee will present a tutorial on how to add labels to a dataset for sentiment analysis using Python.

In [None]:
!pip install nltk
!pip install openpyxl

In [None]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download("vader_lexicon")
import pandas as pd
# data = pd.read_csv("https://raw.githubusercontent.com/amankharwal/Website-data/master/reviews%20data.csv")
# data = data.dropna()
# print(data.head())

In [None]:
excel_file = pd.ExcelFile('Texas Last Statement - Excel.xlsx')
# View the excel_file's sheet names
print(excel_file.sheet_names)
  
# Load the excel_file's Sheet1 as a dataframe
df = excel_file.parse('Sheet1')
# df = df.dropna()

print(df.T)

So this dataset contains many columns, and I will now move to the task of adding labels to the dataset. I will start by adding four new columns to this dataset as Positive, Negative, Neutral, and Compound by calculating the sentiment scores of the column containing textual data (Last Statement column):

In [None]:
sentiments = SentimentIntensityAnalyzer()
df["Positive"] = [sentiments.polarity_scores(i)["pos"] for i in df["Last Statement"]]
df["Negative"] = [sentiments.polarity_scores(i)["neg"] for i in df["Last Statement"]]
df["Neutral"] = [sentiments.polarity_scores(i)["neu"] for i in df["Last Statement"]]
df['Compound'] = [sentiments.polarity_scores(i)["compound"] for i in df["Last Statement"]]
df.head()

As you can see in the above output, we have added four new columns containing the sentiment scores of the Review column. Now the next task is to add labels by categorizing these scores. According to the industry standards, if the compound score of sentiment is more than 0.05, then it is categorized as Positive, and if the compound score is less than -0.05, then it is categorized as Negative, otherwise, it’s neutral. So with this information, I will add a new column in this dataset which will include all the sentiment labels:

In [None]:
score = df["Compound"].values
sentiment = []
for i in score:
    if i >= 0.05 :
        sentiment.append('Positive')
    elif i <= -0.05 :
        sentiment.append('Negative')
    else:
        sentiment.append('Neutral')
df["Sentiment"] = sentiment
df.head()

Now let’s have a look at the frequencies of all the labels:

In [None]:
print(df["Sentiment"].value_counts())

So now we are ended up with multiple columns in this dataset which is now labeled. The "Last Statement" column was the only primary column in the dataset and we added four columns containing the sentiment scores, and at last, we added a new column containing labels according to the sentiment scores. If you only want the text and label columns, you can remove all other columns and save your dataset. To save your new labeled data, you can execute the command mentioned below:

In [None]:
df.to_csv("new_data.csv")

# Summary
So this is how you can add labels to an unlabeled dataset for sentiment analysis using the Python programming language. Adding labels to an unlabeled dataset is very important before we can use it for solving a problem.