Scraping and Sanitizing Data Code Upload #206

AlexDoytch · 2020-06-08T05:23:18Z

Reference to issue

Scraping And Sanitizing Data Proposal #161

Description of the changes proposed in the pull request

Uploaded web scraper code for review

Reviewers requested:

@kylebegovich

updating inline python code with backticks and fixing find_all() into findAll()

kylebegovich

Adding review comments. All in all. things look great: a couple of nits and one addition, then good to merge. Approving to unblock once changes are in

kylebegovich · 2020-06-15T03:49:10Z

blog.md

+
+We now have our data, we want to re-format our data and add it to a final set. We take the stats variable and pass it into a Pandas DataFrame. Due to how the information was stored, our stats variable is a 4 element list with each element being size 100. We want the opposite of that, so we call `.transpose()` on our new dataframe to flip the rows to columns and vice versa, and set the column names to our list of features. We then call `.append()` with the final set DataFrame.
+
+# Bonus: Cleaning The Data


re: conversation in our meeting, worth showing a screenshot pre- vs. post-filtered to demonstrate the filtering in this section. Otherwise, this section looks pretty good

kylebegovich · 2020-06-15T03:49:56Z

blog.md

+
+We were able to work with a cleaned data set that didn't require too much to change. However, one may want to make more specific queries on the existing data set or eliminate any potential null values. For example, I can use the `.isin()` command to extract a value or list of values. So, if I only wanted the list of twenty point per game scorers from the Suns, Bulls, Lakers, Celtics and Knicks, I can extract them with `isin()`. For null values, imagine an older data set that started from 1964 instead of 1984. The three point line didn't exist before 1979, so I can use `dropna()` to remove any instances of seasons before the three point era. This snippet below showcases a basic use of `isin()` and `dropna()`. 
+
+![Alt Text](https://i.imgur.com/Mn052zQ.png)


include the updated code that prints or otherwise shows the output in this screenshot

kylebegovich · 2020-06-15T03:50:44Z

scraper.py

+new_df = new_df.append(pd.Series(dtype='object'),ignore_index=True)
+new_df = new_df.dropna()
+final_df = final_df.dropna()
+


add the section for displaying filtered data towards the end here

kylebegovich · 2020-06-15T03:52:16Z

blog.md

+
+Understanding where our data lies is crucial for implementing the scraper. BeautifulSoup uses an html parser to locate the data we want, but we have to give it some baseline information to do so. 
+
+Before we start, we want to ensure we have the following libraries installed using pip install *package_name*: pandas, numpy, requests and bs4. The first thing we want to do for our scraper is determine how many pages of information there are. For this particular dataset, it is laid out across 9 different URLs. We have to examine what changes occur between two pages. It is a really long link, but we can see at the end there is a part that says offset=0 for the first page, but is equal to 100 for the second page, and increments by 100 each time. 


consider an "installing dependencies" piece, either here or in the code with an inline comment about using pip3 (or another way?)

kylebegovich · 2020-06-15T03:52:30Z

scraper.py

@@ -0,0 +1,38 @@
+import numpy as np


ref comment above: "consider an "installing dependencies" piece, either here or in the code with an inline comment about using pip3 (or another way?)"

kylebegovich · 2020-06-15T03:53:58Z

blog.md

+
+# Bonus: Cleaning The Data
+
+We were able to work with a cleaned data set that didn't require too much to change. However, one may want to make more specific queries on the existing data set or eliminate any potential null values. For example, I can use the `.isin()` command to extract a value or list of values. So, if I only wanted the list of twenty point per game scorers from the Suns, Bulls, Lakers, Celtics and Knicks, I can extract them with `isin()`. For null values, imagine an older data set that started from 1964 instead of 1984. The three point line didn't exist before 1979, so I can use `dropna()` to remove any instances of seasons before the three point era. This snippet below showcases a basic use of `isin()` and `dropna()`. 


Suggested change

We were able to work with a cleaned data set that didn't require too much to change. However, one may want to make more specific queries on the existing data set or eliminate any potential null values. For example, I can use the `.isin()` command to extract a value or list of values. So, if I only wanted the list of twenty point per game scorers from the Suns, Bulls, Lakers, Celtics and Knicks, I can extract them with `isin()`. For null values, imagine an older data set that started from 1964 instead of 1984. The three point line didn't exist before 1979, so I can use `dropna()` to remove any instances of seasons before the three point era. This snippet below showcases a basic use of `isin()` and `dropna()`.

We were able to work with a cleaned data set that didn't require too much to change. However, one may want to make more specific queries on the existing data set or eliminate any potentially null or missing values. For example, I can use the `.isin()` command to extract a value or list of values. So, if I only wanted the list of twenty point per game scorers from the Suns, Bulls, Lakers, Celtics and Knicks, I can extract them with `isin()`. For null values, imagine an older data set that started from 1964 instead of 1984. The three point line didn't exist before 1979, so I can use `dropna()` to remove any instances of seasons before the three point era. This snippet below showcases a basic use of `isin()` and `dropna()`.

adding "or missing" to the first use of null, hopefully helps things be less ambiguous

kylebegovich · 2020-06-15T03:57:32Z

blog.md

@@ -0,0 +1,94 @@
+# Scraping and Sanitizing Data in Python by Beautiful Soup


nit: can consider another title, it's a little clunky as is but I like having all of "Scraping, Python, and Beautiful Soup" in there

Create scraper.py

4dfecbb

AlexDoytch requested a review from shreythecray as a code owner June 8, 2020 05:23

AlexDoytch and others added 5 commits June 11, 2020 20:28

Update scraper.py

a0be5d6

Blog Rough Draft

8528f62

Update blog.md

b076790

Additional print statements

c660c10

semantic updates

07c0e3c

updating inline python code with backticks and fixing find_all() into findAll()

kylebegovich reviewed Jun 15, 2020

View reviewed changes

kylebegovich approved these changes Jun 15, 2020

View reviewed changes

AlexDoytch added 4 commits June 17, 2020 18:00

Update blog.md

d83573c

Update blog.md

2f49f23

Added to_csv

37b9a34

Added way to send to a csv file

779780f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scraping and Sanitizing Data Code Upload #206

Scraping and Sanitizing Data Code Upload #206

AlexDoytch commented Jun 8, 2020

kylebegovich left a comment

kylebegovich Jun 15, 2020

kylebegovich Jun 15, 2020

kylebegovich Jun 15, 2020

kylebegovich Jun 15, 2020

kylebegovich Jun 15, 2020

kylebegovich Jun 15, 2020

kylebegovich Jun 15, 2020


		We now have our data, we want to re-format our data and add it to a final set. We take the stats variable and pass it into a Pandas DataFrame. Due to how the information was stored, our stats variable is a 4 element list with each element being size 100. We want the opposite of that, so we call `.transpose()` on our new dataframe to flip the rows to columns and vice versa, and set the column names to our list of features. We then call `.append()` with the final set DataFrame.

		# Bonus: Cleaning The Data


		We were able to work with a cleaned data set that didn't require too much to change. However, one may want to make more specific queries on the existing data set or eliminate any potential null values. For example, I can use the `.isin()` command to extract a value or list of values. So, if I only wanted the list of twenty point per game scorers from the Suns, Bulls, Lakers, Celtics and Knicks, I can extract them with `isin()`. For null values, imagine an older data set that started from 1964 instead of 1984. The three point line didn't exist before 1979, so I can use `dropna()` to remove any instances of seasons before the three point era. This snippet below showcases a basic use of `isin()` and `dropna()`.

		![Alt Text](https://i.imgur.com/Mn052zQ.png)


		Understanding where our data lies is crucial for implementing the scraper. BeautifulSoup uses an html parser to locate the data we want, but we have to give it some baseline information to do so.

		Before we start, we want to ensure we have the following libraries installed using pip install package_name: pandas, numpy, requests and bs4. The first thing we want to do for our scraper is determine how many pages of information there are. For this particular dataset, it is laid out across 9 different URLs. We have to examine what changes occur between two pages. It is a really long link, but we can see at the end there is a part that says offset=0 for the first page, but is equal to 100 for the second page, and increments by 100 each time.

		@@ -0,0 +1,94 @@
		# Scraping and Sanitizing Data in Python by Beautiful Soup

Scraping and Sanitizing Data Code Upload #206

Are you sure you want to change the base?

Scraping and Sanitizing Data Code Upload #206

Conversation

AlexDoytch commented Jun 8, 2020

Reference to issue

Description of the changes proposed in the pull request

Reviewers requested:

kylebegovich left a comment

Choose a reason for hiding this comment

kylebegovich Jun 15, 2020

Choose a reason for hiding this comment

kylebegovich Jun 15, 2020

Choose a reason for hiding this comment

kylebegovich Jun 15, 2020

Choose a reason for hiding this comment

kylebegovich Jun 15, 2020

Choose a reason for hiding this comment

kylebegovich Jun 15, 2020

Choose a reason for hiding this comment

kylebegovich Jun 15, 2020

Choose a reason for hiding this comment

kylebegovich Jun 15, 2020

Choose a reason for hiding this comment