# Scraping Data from the Web
Here we will use Beautiful soup to scrape data from Yelp. We are barely scratching the surface of the power of scraping the web with Python. There are several other libraris and ways of doing this.

Run the Notebook in order.

In [None]:
# Import our libraries
import requests
from bs4 import BeautifulSoup
import re

#### Open a browser and go to https://www.yelp.com/



#### Bring up a page for any business you would like and get the link

In [None]:
# Set the Yelp Link Here as a string
LINK = 'https://www.yelp.com/biz/bakersfield-columbus'

In [None]:
# Here we are using Python Requests to "GET" the page
r = requests.get(LINK)

In [None]:
# Get the text from the response. In this case its actually HTML code
html_code = r.text

# Print it just to see what it looks like
print(html_code)

In [None]:
# Initialize Beautifulsoup with our html_code and use the 'html.parser' so we can parse the html code
soup = BeautifulSoup(html_code, 'html.parser')

In [None]:
# Here we are going to create a regex to search through the html code in this case looking for comments
regex = re.compile('.*comment.*')

# Using the find_all method from soup class find all p anchors with a class containing our regex
# find_all method finds every instance and places into a list
results = soup.find_all('p', {'class':regex})

# If curious you can print the results
print(results)

In [None]:
# Using the text method we will extract only the text of each result in the list and save into a new list called reviews
reviews = [result.text for result in results]

# Number of reviews
print(len(reviews))
# Print each review on new line. Just makes it easier to read.
[print(review) for review in reviews]

# Getting started with Pandas Dataframes
Pandas DataFrames serve as a fundamental tool for data scientists, analysts, and developers, providing a versatile and efficient way to work with structured data, perform data operations, and prepare data for further analysis or machine learning tasks.

Simple way to create, label, and clean our data before training. **The point here is to get a taste for what it is like to create data to train a model.**

### Our Model
We are preparing data for a sentiment analysis model where our model takes in some text (review) and tries to identify if the text is positive or negative. Below steps will show how we can label data we captured with either a positive or negative sentiment.

#### Check out this quick primer on Pandas https://www.youtube.com/shorts/H0JriItTqn8

In [None]:
# import pandas library
import pandas as pd

In [None]:
# Lets import each one of our reviews from above into a Pandas dataframe for analysis.

# Create a Dataframe from pandas using the DataFrame class
df = pd.DataFrame(reviews, columns=['review'])

# Print the dataframe. NOTE: not using print() method displays it in a nicer format.
df

In [None]:
# Add an additional empty colummn for sentiment in the dataframe. This column will be where we add our analysis.
df['sentiment'] = "" 

# See it
df

In [None]:
# Show the text of the first review. Here we filter the first row with .iloc and ONLY the 'review' column.
df['review'].iloc[0]

In [None]:
# After reading the text, is it positive or negative?

# We will consider this review as positive
df['sentiment'].iloc[0] = 'positive'

# display first row 
df.iloc[0]

# Or uncomment to display the entire DF
#df

In [None]:
# Repeat the same step for the rest of the rows. Run the next two cells one after another first changing the row variable
row = 1

df['review'].iloc[row]

In [None]:
# Make Positive or negative
df['sentiment'].iloc[row] = '' #'positive' or negative

In [None]:
# Display the results when done
df

In [None]:
# If you would like to randomly generate the sentiment you can uncomment all the cell code below
# import random

# rand_sentiment = ['positive', 'negative']

# for row in range(len(df)):
#     df['sentiment'].iloc[row] = random.choice(rand_sentiment)

# df

In [None]:
# To save our DF to pickup where we left off, we can dump it to a file such as a csv or parquet. 

# Lets save our DF as a csv.
df.to_csv('yelp_sentiment.csv')

# You will see that a new file was created in your current directory.

In [None]:
# When you are ready to pick back up where you left off you can load a file into a DF using the read_csv() method.
import pandas as pd

# Calling this new_df so not to confuse with current 
new_df = pd.read_csv('yelp_sentiment.csv', index_col=0)

# Print it and make sure the DF matches where you left off above.
new_df

# Importing Data from Kaggle
For our dataset for training our model we will use a dataset from Kaggle.

Go to [Kaggle](https://www.kaggle.com/datasets/zhenyufan/yelp-reveiws?resource=download) and download the Yelp Dataset there so you can become familiar with Kaggle.

**If you cannot download and extract it, a copy is at 'yelp.csv'.**

In [None]:
# Lets begin taking a look at the dataset and become familiar with it.
import pandas as pd

# Read in the csv to a DF 
df = pd.read_csv('yelp.csv')

# Get information on the data
df.info()

## Dataset for our Model
You will see that the dataset consists of 10K entries with 7 columns. Columns of interest to train our sentiment model would be:
- stars - This would indicate 1-5 rating on experience. 1 being low and 5 being high.
- text - actual review text. This would be classified as either positive or negative for our model. 

To train our sentiment analysis classifier model, we could take take any rating 3 or higher as positive and anything 2 or lower as negative. 

In [None]:
# Just to clean things up a bit let's delete columns we dont need: business_id, date, review_id, type, user_id, cool, useful and funny.
df = df.drop(columns=['business_id', 'date', 'review_id', 'date', 'type', 'user_id', 'cool', 'useful', 'funny'])

In [None]:
# Lets create another column called 'sentiment'.
df['sentiment'] = ""

In [None]:
# Now lets do some calculations. For any row where the stars are 3 or higher, we consider the sentiment to be positive. 2 or under is negative.
for row in range(len(df)):
    # If greater than or equal to 3 its positive else its negative
    if df['stars'].iloc[row] >= 3:
        df['sentiment'].iloc[row] = 'positive'
    else:
        df['sentiment'].iloc[row] = 'negative'

# display it
df

In [None]:
# Like we did above, save our training data to a csv file for training a model later
df.to_csv('training_data.csv')