In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (10, 5)

<center><h2>CWH Data Science Workshop!🥳</h2></center>

## What is data science? 🤔

### What is data science?

<br>

<center><img src='imgs/what-is-data-science.png' width=60%></center>

<center>Everyone seems to have their own definition of what data science is.</center>

Let's look at definitions of data science.

### What is data science?

<center><img src="imgs/image_0.png"></center>

In 2010, Drew Conway published his famous [Data Science Venn Diagram](http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram).

### What is data science?

There isn't agreement on which "Venn Diagram" is correct!

<center><img src="imgs/image_1.png" width=500></center>

- **Why not?** The field is new and rapidly developing.
- Make sure you're solid on the fundamentals, then find a niche that you enjoy.
- Read Kolassa, [Battle of the Data Science Venn Diagrams](http://www.prooffreader.com/2016/09/battle-of-data-science-venn-diagrams.html).

### What does a _data scientist_ do?

The chart below is taken from the [2016 Data Science Salary Survey](https://www.oreilly.com/radar/2016-data-science-salary-survey-results/), administered by O'Reilly. They asked respondents what they spend their time doing on a daily basis. What do you notice? <br>

<center><img src='imgs/survey.png' width=40%></center>

The chart below is taken from the followup [2021 Data/AI Salary Survey](https://www.oreilly.com/radar/2021-data-ai-salary-survey/), also administered by O'Reilly. They asked respondents:

> What technologies will have the biggest effect on compensation in the coming year?

<center><img src='imgs/2021-most-relevant-skill.png' width=45%></center>

As you take more courses, you train to ask questions whose answers are **ambiguous** – this uncertainly is what makes data science challenging!

Let's look at some examples of data science in practice.

### Analyzing Wordle trends

<center><img src='imgs/wordle-moving-average.png' width=70%></center>
    
Moving average of the average number of guesses taken for each Wordle word, based on patterns shared on Twitter. ([source](https://observablehq.com/@rlesser/wordle-twitter-exploration))

### Is Wordle Dying? The Data Weighs In

<center><img src="imgs/wordle-drop.png" width=40%></center>

> Compared to the peak of 350,000 shares in mid-February, a typical day in September 2022 only saw about 32,000 Wordle shares on Twitter. That’s a 91 percent drop in a span of seven months. 
([source](https://wordfinder.yourdictionary.com/blog/is-wordle-dying-the-data-weighs-in/))

### Data science involves _people_ 🧍

The decisions that we make as data scientists have the potential to impact the livelihoods of other people.

- COVID case forecasting.
- Admissions and hiring.
- Hyper-personalized ad recommendations.
- Criminal sentencing.

### Warning!

- Good data analysis is not:
    - A simple application of a statistics formula.
    - A simple application of statistical software.

- There are many tools out there for data science, but they are merely tools. **They don’t do any of the important thinking – that's where you come in!**

> _“The purpose of computing is insight, not numbers.”_ - R. Hamming. Numerical Methods for Scientists and Engineers (1962).

## The data science lifecycle 🚴

### The scientific method

You learned about the scientific method in elementary school. 

<center><img src="imgs/image_3.png" width=500></center>

However, it hides a lot of complexity.
- Where did the hypothesis come from?
- What data are you modeling? Is the data sufficient?
- Under which conditions are the conclusions valid?

### The data science lifecycle

<center><img src="imgs/DSLC.png" width="40%"></center>

**All steps lead to more questions!** We'll refer back to the data science lifecycle repeatedly throughout the quarter.

# Let's start looking at actual data

In [2]:
# dataframe with info
df = pd.read_csv("data/honeyproduction.csv")

In [3]:
df

Unnamed: 0,state,numcol,yieldpercol,totalprod,stocks,priceperlb,prodvalue,year
0,AL,16000.0,71,1136000.0,159000.0,0.72,818000.0,1998
1,AZ,55000.0,60,3300000.0,1485000.0,0.64,2112000.0,1998
2,AR,53000.0,65,3445000.0,1688000.0,0.59,2033000.0,1998
3,CA,450000.0,83,37350000.0,12326000.0,0.62,23157000.0,1998
4,CO,27000.0,72,1944000.0,1594000.0,0.70,1361000.0,1998
...,...,...,...,...,...,...,...,...
621,VA,4000.0,41,164000.0,23000.0,3.77,618000.0,2012
622,WA,62000.0,41,2542000.0,1017000.0,2.38,6050000.0,2012
623,WV,6000.0,48,288000.0,95000.0,2.91,838000.0,2012
624,WI,60000.0,69,4140000.0,1863000.0,2.05,8487000.0,2012
