-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scraping and Sanitizing Data Code Upload #206
base: #ScrapingAndSanitizingData
Are you sure you want to change the base?
Scraping and Sanitizing Data Code Upload #206
Conversation
updating inline python code with backticks and fixing find_all() into findAll()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding review comments. All in all. things look great: a couple of nits and one addition, then good to merge. Approving to unblock once changes are in
|
||
We now have our data, we want to re-format our data and add it to a final set. We take the stats variable and pass it into a Pandas DataFrame. Due to how the information was stored, our stats variable is a 4 element list with each element being size 100. We want the opposite of that, so we call `.transpose()` on our new dataframe to flip the rows to columns and vice versa, and set the column names to our list of features. We then call `.append()` with the final set DataFrame. | ||
|
||
# Bonus: Cleaning The Data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: conversation in our meeting, worth showing a screenshot pre- vs. post-filtered to demonstrate the filtering in this section. Otherwise, this section looks pretty good
|
||
We were able to work with a cleaned data set that didn't require too much to change. However, one may want to make more specific queries on the existing data set or eliminate any potential null values. For example, I can use the `.isin()` command to extract a value or list of values. So, if I only wanted the list of twenty point per game scorers from the Suns, Bulls, Lakers, Celtics and Knicks, I can extract them with `isin()`. For null values, imagine an older data set that started from 1964 instead of 1984. The three point line didn't exist before 1979, so I can use `dropna()` to remove any instances of seasons before the three point era. This snippet below showcases a basic use of `isin()` and `dropna()`. | ||
|
||
![Alt Text](https://i.imgur.com/Mn052zQ.png) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
include the updated code that prints or otherwise shows the output in this screenshot
new_df = new_df.append(pd.Series(dtype='object'),ignore_index=True) | ||
new_df = new_df.dropna() | ||
final_df = final_df.dropna() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the section for displaying filtered data towards the end here
|
||
Understanding where our data lies is crucial for implementing the scraper. BeautifulSoup uses an html parser to locate the data we want, but we have to give it some baseline information to do so. | ||
|
||
Before we start, we want to ensure we have the following libraries installed using pip install *package_name*: pandas, numpy, requests and bs4. The first thing we want to do for our scraper is determine how many pages of information there are. For this particular dataset, it is laid out across 9 different URLs. We have to examine what changes occur between two pages. It is a really long link, but we can see at the end there is a part that says offset=0 for the first page, but is equal to 100 for the second page, and increments by 100 each time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider an "installing dependencies" piece, either here or in the code with an inline comment about using pip3 (or another way?)
@@ -0,0 +1,38 @@ | |||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ref comment above: "consider an "installing dependencies" piece, either here or in the code with an inline comment about using pip3 (or another way?)"
blog.md
Outdated
|
||
# Bonus: Cleaning The Data | ||
|
||
We were able to work with a cleaned data set that didn't require too much to change. However, one may want to make more specific queries on the existing data set or eliminate any potential null values. For example, I can use the `.isin()` command to extract a value or list of values. So, if I only wanted the list of twenty point per game scorers from the Suns, Bulls, Lakers, Celtics and Knicks, I can extract them with `isin()`. For null values, imagine an older data set that started from 1964 instead of 1984. The three point line didn't exist before 1979, so I can use `dropna()` to remove any instances of seasons before the three point era. This snippet below showcases a basic use of `isin()` and `dropna()`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were able to work with a cleaned data set that didn't require too much to change. However, one may want to make more specific queries on the existing data set or eliminate any potential null values. For example, I can use the `.isin()` command to extract a value or list of values. So, if I only wanted the list of twenty point per game scorers from the Suns, Bulls, Lakers, Celtics and Knicks, I can extract them with `isin()`. For null values, imagine an older data set that started from 1964 instead of 1984. The three point line didn't exist before 1979, so I can use `dropna()` to remove any instances of seasons before the three point era. This snippet below showcases a basic use of `isin()` and `dropna()`. | |
We were able to work with a cleaned data set that didn't require too much to change. However, one may want to make more specific queries on the existing data set or eliminate any potentially null or missing values. For example, I can use the `.isin()` command to extract a value or list of values. So, if I only wanted the list of twenty point per game scorers from the Suns, Bulls, Lakers, Celtics and Knicks, I can extract them with `isin()`. For null values, imagine an older data set that started from 1964 instead of 1984. The three point line didn't exist before 1979, so I can use `dropna()` to remove any instances of seasons before the three point era. This snippet below showcases a basic use of `isin()` and `dropna()`. |
adding "or missing" to the first use of null, hopefully helps things be less ambiguous
@@ -0,0 +1,94 @@ | |||
# Scraping and Sanitizing Data in Python by Beautiful Soup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can consider another title, it's a little clunky as is but I like having all of "Scraping, Python, and Beautiful Soup" in there
Reference to issue
Description of the changes proposed in the pull request
Reviewers requested: