In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("gla02.ipynb")

<img src="./ccsf.png" alt="CCSF Logo" width=200px style="margin:0px -5px">

# Guided Learning Activity 02: Programming Fundamentals and NLP

This Guided Learning Activity is designed for you to complete alongside a Data Ambassador from the course. You might find that it feels like a combination of the lectures and lab assignment. Whether you are participating live or watching the recording of the live meeting, let the Data Ambassador guide you through the following tasks. There will be moments for you to reflect and explore your own ideas as a way to solidify concepts and skills introduced by your instructor. Keep in mind that this is not a graded assignment for MATH 108 by default. If you have any concerns about participation, reach out to your instructor.

---

## Learning Objectives

1. Confirm a variable's data type.
2. Access the attributes and methods of an object.
3. Compare and contrast lists and arrays.
4. Apply string methods and NumPy functions for text manipulation and cleaning.
5. Perform text preprocessing tasks, including lowercasing and tokenization, using NLTK.
6. Understand and apply basic sentiment analysis using NLTK.

---

## Object-Oriented Programming

Python is an object-oriented programming language, meaning it organizes code around objects. In Python, **everything is an object**, including instances of the built-in data types you are learning about like `int`, `float`, `str`, etc. Objects bundle **attributes** (which store data) and **methods** (which define behaviors). Run the following code cell to create the variables `an_int` and `a_string` which reference an `int` object and a `string` object, respectively.

In [None]:
an_int = 10
a_string = "data science"

---

### Task 01 💭

<!-- BEGIN QUESTION -->

How can you check the data type of a variable in Python?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

---

### Attributes

An object's attributes are like variables that hold information about the object. For example, as a fraction, the integer $10$ can be expressed as $\frac{10}{1}$ where the numerator is $10$ and denominator is $1$. Python developers felt it was helpful to have that information readily available for every instance of an `int` object. So, `an_int` has attributes `numerator` and `denominator`. Run the following code cells to see this information.

In [None]:
an_int.numerator

In [None]:
an_int.denominator

Notice commands to access attribute information follow the pattern `<object reference>.<attribute name>`.

---

### Methods

An object's methods are functions that operate on the object. For example, the string `"data science"` associated with `a_string` is presented in lower case, and Python developers agreed it would be convenient to have a built in function that makes ever letter in a string upper case. So, `a_string` has a method called `upper`. Run the following command to see the results.

In [None]:
a_string.upper()

In this case, the command to access a method follows the pattern `<object reference>.<method name>()`. The parentheses are needed to perform the action! If you leave them off, then you are just reminded that `upper` is a function. Run the following code cell to see this in action.

In [None]:
a_string.upper

---

### Attributes and Methods

Hopefully, this helps you better understand dot notation in Python. When you perform an assignment, the variable you assign a value to is a reference to an object. This object _can_ have associated attributes and methods based on its data type. The word _can_ in that last statement is important to notice since it is possible for an object to not have any attributes or methods associated with it. For example, strings (`str`) objects only have functions associated with them. How can you know that? You can look at the [Python documentation]() or you can leverage a shortcut in Jupyter Notebooks. If you have created a variable, like `a_string`, then when you type `a_string.` and press <tab>, you gain access to all the associated methods (labeled as functions) and attributes (labeled as instances). Try it! You should see a menu that looks like this:

<img src="./string_menu.png" width="200rem" alt="The first part of the menu showcasing string attributes and methods.">

In [None]:
# Type a_string. followed by pressing <tab>


---

## Customer Reviews

You've likely read several product/service reviews in your lifetime, and there is also a good chance that you've left a few reviews yourself. Those reviews might look like:

> "Great product!! Love it... 100% recommended!! "

or

> "0/5 STARS! Worst experience of my life."

There can be a lot of valuable information found in these reviews, but extracting that information has been challenging. The review with the negative tone might be connected to a person with a valid negative critique of the service, but it could also have been left by someone whose negativity was brought on by some other factor.

### Natural Language Processing

Natural Language Processing (NLP) is a field of computer science and artificial intelligence that focuses on enabling computers to understand, interpret, and generate human language. It combines linguistics with machine learning to analyze text and speech, allowing applications like chatbots, translation services, and sentiment analysis. So, if someone wanted to a structured way to analyze and extract the information from reviews, then they could utilize NLP concepts and techniques.

### Data Cleaning

With NLP and many other data-related subjects, data cleaning becomes a crucial step involved with being able to extract meaningful information from data and make reliable conclusions. So, with text-based reviews, you need to be able to collect the data, extract the text from the data, clean the data, and so on. At some point, perhaps after forming some conclusion, you'll need to collect more data and continue this general cycle.

```mermaid
flowchart LR
    A[Collect Data] --> B[Extract Text] --> C[Clean Text] --> D[...] --> E[Make Conclusion] --> A
```

So, what does cleaning text-based data involve? A lot! The language and context of the text is very important. For example, reading "A+" in a review should indicate to most of us that the review was positive, but that association may not be as prevalent in other cultures. NLP specialists work with computers to identify and utilize these cultural patterns in text. You'll sometimes here this part of Data Science referred to as _domain knowledge_. You can have all the computing power you need, but you still need understanding of the context in which the data is associated with.

#### Task 02 💻

Using string methods on `first_review`, transform `"LOVE IT! Great sandwich ❤️"` into all lower case characters and assign the results to `first_review_lower`.

In [None]:
first_review = "LOVE IT! Great sandwich ❤️"
first_review_lower = ...
print(first_review_lower)

In [None]:
grader.check("task_02")

---

Why should you convert the text to all lower case? Well, one reason is that maybe someone else has left a review like "Great Sandwich!". In this review and the previous one, the phrase "great sandwich" is there, but the capitalization is different. Standardizing the presentation of those phrases would allow Python to count that there are 2 instances of the same phrase. If you left the capitalization as it was originally, then Python would not consider these two phrases to be the same.

---

#### Task 03 💻

Using string methods on `second_review` and `third_review`, transform `"Great Sandwich!"` and `"This sandwich was terrible ..."` into all lower case characters and assign the results to `second_review_lower` and `third_review_lower`, respectively.

In [None]:
second_review = "Great Sandwich!"
third_review = "This sandwich was terrible ..."
second_review_lower = ...
third_review_lower = ...
print(second_review_lower)
print(third_review_lower)

In [None]:
grader.check("task_03")

---

### NumPy

Could you imagine cleaning up hundreds, thousands, or millions of reviews like that?! When you are performing the same task on many different things with a computer, then there is likely a way to get the computer to apply the generic process to everything for you. In this case, you can leverage sequenced data types to apply some function to many objects in one command.

[NumPy](https://numpy.org) is a package of Python code that extends Python's built-in functionality. One of the main data types in this pack is the `numpy.ndarray`. We'll likely use `array` for a shorted name.

In order to access the tools from NumPy, you need to import the package `numpy`. It is pretty standard to abbreviate `numpy` as `np`. Run the following command to do that.

In [None]:
import numpy as np

After importing `numpy` as `np`, when you type `np.` and press <tab> in a Jupyter, you'll preview all the included functions, data types, etc.

In [None]:
# Type np. followed by pressing <tab>


There is an overwhelming amount of tools available in the NumPy package. In MATH 108, you'll focus on using only a few of the functions to keep things more manageable. 

---

#### Task 04 💭

<!-- BEGIN QUESTION -->

Python has a built-in data type called `list`. What is the major difference between lists and arrays?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

---

You create lists in Python using bracket notation `[item1, item2, item3, ...]`. To create an array, you can use the function `np.array`. For the input to that function, provide a list. So, `np.array(["Hello", "Hi", "Hey"])` would create an array of 3 strings. 

The MATH 108 materials indicate that you use the `make_array` function from the `datascience` library to create arrays. This is essentially doing the same thing as using `np.array`. You'll find that using `np.array` is a more standard approach outside of the course.

Run the following code cell to import everything from the `datascience` library if you choose to use `make_array`.

In [None]:
from datascience import *

#### Task 05 💻

Create a NumPy array called `array_of_reviews` that contains the the reviews `first_review`, `second_review`, and `third_review`, in that order. The resulting list should display as:

```
array(['LOVE IT! Great sandwich ❤️', 'Great Sandwich!',
       'This sandwich was terrible ...'], dtype='<U30')
 ```

In [None]:
array_of_reviews = ...
array_of_reviews

In [None]:
grader.check("task_05")

---

NumPy provides a function called `np.char.lower`, which takes an array (or list) as input and returns a new array with all the items converted to lowercase.  

Notice the dots in `np.char.lower`! This structure reflects how NumPy is organized: **NumPy** (`np`) is a package containing multiple **modules**, and `char` is one of them. The `lower` function resides within the `char` module, which specifically handles string operations in NumPy.  

Run the following code cell to view the documentation for this function.

In [None]:
np.char.lower?

The line ```Call :meth:`str.lower` element-wise.``` indicates that this function applies Python's `lower` string method to each string item in the array. Essentially, `np.char.lower` works by calling `str.lower()` on every string in the input array.

---

#### Task 06 💻

Apply `np.char.lower` to `array_of_reviews` to convert all three reviews to lowercase, and assign the new array to `array_of_reviews_lower`.

In [None]:
array_of_reviews_lower = ...
array_of_reviews_lower

In [None]:
grader.check("task_06")

---

#### Basic Analysis

Now that all the reviews have been cleaned to the point where all the characters are presented in lowercase, we can do some basic analysis. The numpy function `np.strings.count` looks in an array for a pattern and returns the number of items that pattern appeared in each item. So, the following command should show an output that is an array with three 1's because `"sandwich"` appears once in each of the 3 reviews.

In [None]:
np.strings.count(array_of_reviews_lower,  "sandwich")

Without using the lowercase version of the reviews, notice with the following code cell that the counter would have missed one!

In [None]:
np.strings.count(array_of_reviews,  "sandwich")

Using the information from the cleaned (lowercase) text, we could identify all the reviews the specifically address the sandwich and pass those reviews on to the net phase of the analysis.

---

#### Task 07 💻

Using NumPy's, create an array called `great_count_per_review_arr` that shows the count of how many times `great` appears in each review of the array `array_of_reviews_lower`.

In [None]:
great_count_per_review_arr = ...
great_count_per_review_arr

In [None]:
grader.check("task_07")

---

## Tokenizing

Tokenizing is the process of splitting text into smaller units, called tokens. These tokens can be words, phrases, or even sentences, depending on the level of tokenization. In NLP, tokenizing is a fundamental step that helps break down text into meaningful components for further analysis.

For example, tokenizing this sentence:

"I love this product! It's amazing."

Could produce the list of tokens:

```python
['I', 'love', 'this', 'product', '!', 'It', "'s", 'amazing', '.']
```

In terms of analyzing product/service reviews, tokenizing can help us
* Break reviews into individual words or phrases makes it easier to analyze sentiment, frequency, and patterns.
* Filter out punctuation, stopwords (like "the", "is"), or unwanted characters.
* Identify positive/negative words (e.g. "great", "bad") and determine customer sentiment.
* Count how often certain words appear in reviews.
* Prepare for more advanced operations.

---

### NLTK

With most specialized fields, there are a few standard Python packages/libraries to support the work done in that field. For NLP, [`nltk`](https://www.nltk.org) (Natural Language Toolkit) is such a library. Run the following code cell to import `nltk`, imprt the `word_toeknize` function from the `tokenize` module, and download the tokenizer data `punkt_tab`.

In [None]:
import nltk
from nltk.tokenize import word_tokenize
try:
    nltk.download('punkt_tab', quiet=True)
except:
    print('Something went wrong with downloading the tokenizer data.')

From the `tokenize` module, there is a function called `word_tokenize` that creates a list of tokens from a string that is provided as input. Run the following command to tokenize the string `"Some things are natural about NLP and some things are not!"`. You should get the following list as an output:

```
['Some',
 'things',
 'are',
 'natural',
 'about',
 'NLP',
 'and',
 'some',
 'things',
 'are',
 'not',
 '!']
```

In [None]:
word_tokenize("Some things are natural about NLP and some things are not!")

Notice that `word_tokenize` considered the punctuation as a separate token.

#### Task 08 💻

Using `word_tokenize`, create an **array** called `first_review_tokens_arr` that contains the tokens for the first review in `array_of_reviews_lower`.

In [None]:
first_review_tokens_arr = ...
first_review_tokens_arr 

In [None]:
grader.check("task_08")

---

From this point, you could tokenize every review and perform a variety of standard cleaning tasks such as:
* removing punctuation
* removing stop words such as "the," "is," "and"
* reduce words to their base/root form through a process called stemming. (e.g. The stem of "running" is "run".)

How you clean the reviews depends on what you want to do next.

---

## Sentiment Analysis

What does the word "sentiment" mean? In general, sentiment refers to the feeling, attitude, or opinion that is influenced by emotion. For a simple example, you could consider how positive, negative, or neutral a review is. It is likely that a review is not 100\% positive, so it is probably a blend of these three sentiment labels. One common NLP task is to perform sentiment analysis on some text.

Run the following command to import the `SentimentIntensityAnalyzer` from NLTK, download the sentiment data, and assign `sia` to the sentiment intensity analyzer.

In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer
try:
    nltk.download('vader_lexicon', quiet=True)
except:
    print('Something went wrong with downloading the vader lexicon data.')
sia = SentimentIntensityAnalyzer()

You'll see in a moment that the sentiment scores for `'love it! great sandwich ❤️'` is:
```
{'neg': 0.0, 'neu': 0.259, 'pos': 0.741, 'compound': 0.8622}
```

Overall, the sentiment of the statement is generally positive with some of the sentiment being labeled as neutral. 

The `'vader_lexicon'` information downloaded is part of what helps determine the subjective belief of how positive, negative, or neutral a statement is. [VADER](https://www.nltk.org/api/nltk.sentiment.vader.html#module-nltk.sentiment.vader) applies rules to refine sentiment scores. These rules take amplifying words like "very", negations like "not", punctuation and capitalization like "GOOD!!!", and emojis and slang words such as 😡 into consideration.

### Task 09 💭

<!-- BEGIN QUESTION -->

Run the following commands to see the sentiment analysis for the first review in it's original form and lowercase form.

In [None]:
print(f'Original Review: {first_review}')
sia.polarity_scores(first_review)

In [None]:
print(f'Lowercase Review: {first_review_lower}')
sia.polarity_scores(first_review_lower)

Why do you think the original version was more positive?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

---

### Task 10 💭

<!-- BEGIN QUESTION -->

Run the following code cell to see the sentiment scores for all 3 reviews in their original format plus a few more. Keep in mind that this code uses a `for` loop construction that you'll learn about later in the course. 

In [None]:
array_of_reviews_extended = np.append(
    array_of_reviews, [
        "Great sandwich! The sauce was so savory 🤤",
        "It's a sandwich 😑",
        "Loved it ... 🙄",
        "This sandwich was terrible."
    ]
)
for review in array_of_reviews_extended:
    sentiment = sia.polarity_scores(review)
    print(f'The sentiment scores for: "{review}" are:\n{sentiment}\n')

How did the sentiment analysis turn out? Do you agree with the results for the most part? Did it identify the negative reviews? Is there any cleaning that you think should be done to better capture the sentiment of the reviews?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

---

## Reflection

In this guided learning activity, you developed your understanding of object-oriented programming in Python, particularly the concepts of attributes and methods. Working with strings and exploring their various methods, like `.lower()`, was instrumental in understanding how to manipulate and clean text data.  The introduction to NumPy arrays and their application in batch processing, along with the use of NLTK for tokenization and sentiment analysis, provided a practical glimpse into the world of NLP.

---

## License

This content is licensed under the <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)</a>.

<img src="./by-nc-sa.png" width=100px>