# Scraping Data from the Web
Here we will use Beautiful soup to scrape data from Yelp. We are barely scratching the surface of the power of scraping the web with Python. There are several other libraris and ways of doing this.

Run the Notebook in order.

In [1]:
# Import our libraries
import requests
from bs4 import BeautifulSoup
import re

#### Open a browser and go to https://www.yelp.com/



#### Bring up a page for any business you would like and get the link

In [2]:
# Set the Yelp Link Here as a string
LINK = 'https://www.yelp.com/biz/bakersfield-columbus'

In [3]:
# Here we are using Python Requests to "GET" the page
r = requests.get(LINK)

In [None]:
# Get the text from the response. In this case its actually HTML code
html_code = r.text

# Print it just to see what it looks like
print(html_code)

In [7]:
# Initialize Beautifulsoup with our html_code and use the 'html.parser' so we can parse the html code
soup = BeautifulSoup(html_code, 'html.parser')

In [8]:
# Here we are going to create a regex to search through the html code in this case looking for comments
regex = re.compile('.*comment.*')

# Using the find_all method from soup class find all p anchors with a class containing our regex
# find_all method finds every instance and places into a list
results = soup.find_all('p', {'class':regex})

# If curious you can print the results
print(results)

[<p class="comment__09f24__D0cxf css-qgunke"><span class="raw__09f24__T4Ezm" lang="en">Great spot in the short north for a good casual bite!<br/>Great food, great atmosphere, even better service!<br/>Our server Leyla was phenomenal! Thank you and see you next time!</span></p>, <p class="comment__09f24__ZU8MN css-qgunke"><span class="raw__09f24__T4Ezm">Thanks for taking time to share, Blake! It's wonderful to know that our team, especially Leyla, made your experience memorable. We appreciate your support and look forward to welcoming you back soon!</span> </p>, <p class="comment__09f24__D0cxf css-qgunke"><span class="raw__09f24__T4Ezm" lang="en">Wow-this was delicious. From start to finish, everything was great.<br/><br/>Our server, Nathan, knew the menu very well and paced everything perfectly. Assisted in finding stuff for one of us with a serious allergy with ease. Our drinks, a seasonal cranberry margarita (picture included) was pretty good and looked beautiful, but this pineapple m

In [9]:
# Using the text method we will extract only the text of each result in the list and save into a new list called reviews
reviews = [result.text for result in results]

# Number of reviews
print(len(reviews))
# Print each review on new line. Just makes it easier to read.
[print(review) for review in reviews]

20
Great spot in the short north for a good casual bite!Great food, great atmosphere, even better service!Our server Leyla was phenomenal! Thank you and see you next time!
Thanks for taking time to share, Blake! It's wonderful to know that our team, especially Leyla, made your experience memorable. We appreciate your support and look forward to welcoming you back soon! 
Wow-this was delicious. From start to finish, everything was great.Our server, Nathan, knew the menu very well and paced everything perfectly. Assisted in finding stuff for one of us with a serious allergy with ease. Our drinks, a seasonal cranberry margarita (picture included) was pretty good and looked beautiful, but this pineapple margarita thing we ordered was just phenomenal. I'd recommend both. Queso was not my favorite but it was still tasty. Guac was good and so were the two other sauces that came with it. Sorry I do not know the names of them, but that green one was amazing, tasted so fresh.There were so many o

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]

# Getting started with Pandas Dataframes
Pandas DataFrames serve as a fundamental tool for data scientists, analysts, and developers, providing a versatile and efficient way to work with structured data, perform data operations, and prepare data for further analysis or machine learning tasks.

Simple way to create, label, and clean our data before training. **The point here is to get a taste for what it is like to create data to train a model.**

### Our Model
We are preparing data for a sentiment analysis model where our model takes in some text (review) and tries to identify if the text is positive or negative. Below steps will show how we can label data we captured with either a positive or negative sentiment.

#### Check out this quick primer on Pandas https://www.youtube.com/shorts/H0JriItTqn8

In [10]:
# import pandas library
import pandas as pd

In [11]:
# Lets import each one of our reviews from above into a Pandas dataframe for analysis.

# Create a Dataframe from pandas using the DataFrame class
df = pd.DataFrame(reviews, columns=['review'])

# Print the dataframe. NOTE: not using print() method displays it in a nicer format.
df

Unnamed: 0,review
0,Great spot in the short north for a good casua...
1,"Thanks for taking time to share, Blake! It's w..."
2,"Wow-this was delicious. From start to finish, ..."
3,"Thank you for your positive review, Alannah! W..."
4,Fine choice for a quick bite. Their marketing ...
5,"We appreciate your feedback, Celina. While we ..."
6,I don't think I've ever gone to Bakersfield an...
7,"Thank you for your review, Diana! We appreciat..."
8,Drinks are great! Haven't tried the food yet ...
9,"Thank you for your positive review, Kelly! We'..."


In [12]:
# Add an additional empty colummn for sentiment in the dataframe. This column will be where we add our analysis.
df['sentiment'] = "" 

# See it
df

Unnamed: 0,review,sentiment
0,Great spot in the short north for a good casua...,
1,"Thanks for taking time to share, Blake! It's w...",
2,"Wow-this was delicious. From start to finish, ...",
3,"Thank you for your positive review, Alannah! W...",
4,Fine choice for a quick bite. Their marketing ...,
5,"We appreciate your feedback, Celina. While we ...",
6,I don't think I've ever gone to Bakersfield an...,
7,"Thank you for your review, Diana! We appreciat...",
8,Drinks are great! Haven't tried the food yet ...,
9,"Thank you for your positive review, Kelly! We'...",


In [13]:
# Show the text of the first review. Here we filter the first row with .iloc and ONLY the 'review' column.
df['review'].iloc[0]

'Great spot in the short north for a good casual bite!Great food, great atmosphere, even better service!Our server Leyla was phenomenal! Thank you and see you next time!'

In [14]:
# After reading the text, is it positive or negative?

# We will consider this review as positive
df['sentiment'].iloc[0] = 'positive'

# display first row 
df.iloc[0]

# Or uncomment to display the entire DF
#df

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df['sentiment'].iloc[0] = 'positive'


review       Great spot in the short north for a good casua...
sentiment                                             positive
Name: 0, dtype: object

In [15]:
# Repeat the same step for the rest of the rows. Run the next two cells one after another first changing the row variable
row = 1

df['review'].iloc[row]

"Thanks for taking time to share, Blake! It's wonderful to know that our team, especially Leyla, made your experience memorable. We appreciate your support and look forward to welcoming you back soon! "

In [16]:
# Make Positive or negative
df['sentiment'].iloc[row] = '' #'positive' or negative

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df['sentiment'].iloc[row] = '' #'positive' or negative


In [None]:
# Display the results when done
df

In [None]:
# If you would like to randomly generate the sentiment you can uncomment all the cell code below
# import random

# rand_sentiment = ['positive', 'negative']

# for row in range(len(df)):
#     df['sentiment'].iloc[row] = random.choice(rand_sentiment)

# df

In [None]:
# To save our DF to pickup where we left off, we can dump it to a file such as a csv or parquet. 

# Lets save our DF as a csv.
df.to_csv('yelp_sentiment.csv')

# You will see that a new file was created in your current directory.

In [None]:
# When you are ready to pick back up where you left off you can load a file into a DF using the read_csv() method.
import pandas as pd

# Calling this new_df so not to confuse with current 
new_df = pd.read_csv('yelp_sentiment.csv', index_col=0)

# Print it and make sure the DF matches where you left off above.
new_df

# Importing Data from Kaggle
For our dataset for training our model we will use a dataset from Kaggle.

Go to [Kaggle](https://www.kaggle.com/datasets/zhenyufan/yelp-reveiws?resource=download) and download the Yelp Dataset there so you can become familiar with Kaggle.

**If you cannot download and extract it, a copy is at 'yelp.csv'.**

In [17]:
# Lets begin taking a look at the dataset and become familiar with it.
import pandas as pd

# Read in the csv to a DF 
df = pd.read_csv('yelp.csv')

# Get information on the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   business_id  10000 non-null  object
 1   date         10000 non-null  object
 2   review_id    10000 non-null  object
 3   stars        10000 non-null  int64 
 4   text         10000 non-null  object
 5   type         10000 non-null  object
 6   user_id      10000 non-null  object
 7   cool         10000 non-null  int64 
 8   useful       10000 non-null  int64 
 9   funny        10000 non-null  int64 
dtypes: int64(4), object(6)
memory usage: 781.4+ KB


## Dataset for our Model
You will see that the dataset consists of 10K entries with 7 columns. Columns of interest to train our sentiment model would be:
- stars - This would indicate 1-5 rating on experience. 1 being low and 5 being high.
- text - actual review text. This would be classified as either positive or negative for our model. 

To train our sentiment analysis classifier model, we could take take any rating 3 or higher as positive and anything 2 or lower as negative. 

In [18]:
# Just to clean things up a bit let's delete columns we dont need: business_id, date, review_id, type, user_id, cool, useful and funny.
df = df.drop(columns=['business_id', 'date', 'review_id', 'date', 'type', 'user_id', 'cool', 'useful', 'funny'])

In [19]:
# Lets create another column called 'sentiment'.
df['sentiment'] = ""

In [20]:
# Now lets do some calculations. For any row where the stars are 3 or higher, we consider the sentiment to be positive. 2 or under is negative.
for row in range(len(df)):
    # If greater than or equal to 3 its positive else its negative
    if df['stars'].iloc[row] >= 3:
        df['sentiment'].iloc[row] = 'positive'
    else:
        df['sentiment'].iloc[row] = 'negative'

# display it
df

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



Unnamed: 0,stars,text,sentiment
0,5,My wife took me here on my birthday for breakf...,positive
1,5,I have no idea why some people give bad review...,positive
2,4,love the gyro plate. Rice is so good and I als...,positive
3,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",positive
4,5,General Manager Scott Petello is a good egg!!!...,positive
...,...,...,...
9995,3,First visit...Had lunch here today - used my G...,positive
9996,4,Should be called house of deliciousness!\n\nI ...,positive
9997,4,I recently visited Olive and Ivy for business ...,positive
9998,2,My nephew just moved to Scottsdale recently so...,negative


In [21]:
# Like we did above, save our training data to a csv file for training a model later
df.to_csv('training_data.csv')