A fun side project exploring hot peppers!
I grew up with very little spice in my diet -- my mother is Irish, what can I say? -- and get regularly mocked for my lack of spice tolerance. I figured that I'd do some research into these nuggets of suffering and glory so that the next time I am prompted to eat a pepper, I can distract my mocking friends enough to escape the actual act of pepper consumption.
- Web scraping (Requests, urllib, nonces)
- HTML parsing (BeautifulSoup, Selenium)
- Data sanitization (Pandas, fuzzywuzzy, difflib, Regex)
- Regression analysis (scikit-learn)
- Code design (Python, modules, OOP, scalability)
The data is currently curated from PepperScale, ChiliWorld, Uncle Steve's Hot Stuff, Cayenne Diane, and Pepperheads for Life. I have no affiliation with any of the sites, but am grateful for their work!
While this project is in the "data sanitization" phase, you can find the most up-to-date set in data/
. Both .json
and .csv
formats are available! If you plan on using the data, I'd love to know about it :)
Field | Description |
---|---|
"name" | String; name of the pepper; unique |
"species" | String; pepper species. All hot peppers belong to the Capsicum genus (part of the nightshade family), but there are multiple species within it. |
"heat" | Categorical; how hot the pepper is perceived to be: "Mild", "Medium", "Extra Hot", "Super Hot" (their categories, not mine) |
"region" | Categorical; region of the world in which the pepper grows; based on provided origin. (Standardized origin) |
"origin" | String; where the pepper grows; values are country, region, or continent as listed |
"min_shu" | Float; Scoville Heat Units (SHU) for the mildest variation of the pepper |
"max_shu" | Float; Scoville Heat Units (SHU) for the hottest variation of the pepper |
"min_jrp" | Float; Jalapeño Reference Point (JRP) for the minimum number of times hotter than a jalapeño the pepper is |
"max_jrp" | Float; Jalapeño Reference Point (JRP) for the maximum number of times hotter than a jalapeño the pepper is |
"detail_link" | String; link to more information on the pepper |
"source_link" | String; data source link |
"source_name" | String; name of source site from which pepper data came |
Pepper hotness is based on the Scoville Scale, a measurement of the pungency of chili peppers running from mild to extreme. If you're interested in the scale -- along with its many pros and cons -- I recommend you read PepperScale's article on the subject here, or trusty Wikipedia.
Basis for min/max Scoville heat units (SHU): Individual hot peppers have a range of heat, depending on where they are grown, how long they’ve matured, and the amount of sun they’ve received.
Basis for min/max Jalapeno Reference Point (JRP): The JRP is a subjective comparison of a pepper against a reference point most everyone has tried, resulting in a range of opinions. A negative number (like -50) means the amount of times the pepper is milder. A zero (0) means equal heat. Any positive numbers show the amount of times that the pepper is hotter than a jalapeño.
- How to Scrape an AJAX Website using Python
- Explanation of the "json": {"key":"value"} addition to the POST request (missing in the Requests documentation...?)
- When my scraper broke the day after I built it, I learned about nonces in WordPress. I had to find a way to fetch the daily nonce to complete the AJAX request.
- scikit-learn's description of linear models
- How to handle outliers
- FiveThirtyEight's article on Rating Chili Peppers
- Guinness Book of World Records on the Hottest Chili
Check out my Trello board for insight into my process, what's been done, and what's on the docket.
I welcome any and all contributions from the world at large! If you're interested in collaborating, please consider the following:
- Git flow: fork the repository, submit PR
- Request to be added as a member to the Trello board