# Introduction to Data

While many scientific investigations may begin with an abstract question, this is usually revised in terms of constraints on available information and resources.  In this class, we will often be limited to available data in semi or well structured formats.  Accordingly, we want to be good at asking two different kinds of questions; **Descriptive** and **Inferential**.



### Example I: Ames Housing

This dataset represents information about houses in Ames Iowa.  Here is a link to the description of the dataset:

- [Ames Housing Description](https://ww2.amstat.org/publications/jse/v19n3/decock/DataDocumentation.txt)

Your goal is to consider the different kinds of information available, and to generate at least two *descriptive* and two *inferential* questions about the dataset.  Is there any data that you anticipate being problematic?  

### Example II: Donors Choose

A popular website for learning data science is [kaggle.com](https://www.kaggle.com/).  A recent competition involved a dataset from the organization *Donors Choose* that supports grants to teachers.  They framed the problem as follows:

---

<div class="alert alert-info" role="alert">

Next year, DonorsChoose.org expects to receive close to 500,000 project proposals. As a result, there are three main problems they need to solve:

1. How to scale current manual processes and resources to screen 500,000 projects so that they can be posted as quickly and as efficiently as possible

2. How to increase the consistency of project vetting across different volunteers to improve the experience for teachers

3. How to focus volunteer time on the applications that need the most assistance

The goal of the competition is to predict whether or not a DonorsChoose.org project proposal submitted by a teacher will be approved, using the text of project descriptions as well as additional metadata about the project, teacher, and school. DonorsChoose.org can then use this information to identify projects most likely to need further review before approval.
</div>
---

See the full data description [here](https://www.kaggle.com/c/donorschoose-application-screening/data).  Again, your goal is to identify at least two *inferential* and two *descriptive* questions around this dataset.

### Example III: NYC Open Data

---

![](https://opendata.cityofnewyork.us/wp-content/themes/opendata-wp/assets/img/nyc-open-data-logo.svg)

Many city and state agencies have begun providing access to data through API's.  One example is the NYC Open Data located [here](https://opendata.cityofnewyork.us/).  For example, if we wanted to use the dataset with information about restaurant meal information we find the code necessary to import the data.  Quickly, we have this up and running in a Jupyter notebook and are ready to explore.

```python
import pandas as pd
from sodapy import Socrata

# Unauthenticated client only works with public data sets. Note 'None'
# in place of application token, and no username or password:
client = Socrata("data.cityofnewyork.us", None)

# Example authenticated client (needed for non-public datasets):
# client = Socrata(data.cityofnewyork.us,
#                  MyAppToken,
#                  userame="user@example.com",
#                  password="AFakePassword")

# First 2000 results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get("r3pg-q9c3", limit=31000)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)
```

In [18]:
results_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31000 entries, 0 to 30999
Data columns (total 48 columns):
calories                  25196 non-null object
calories_100g             12572 non-null object
calories_text             119 non-null object
carbohydrates             24572 non-null object
carbohydrates_100g        12463 non-null object
carbohydrates_text        316 non-null object
cholesterol               23775 non-null object
cholesterol_100g          12036 non-null object
cholesterol_text          166 non-null object
dietary_fiber             24081 non-null object
dietary_fiber_100g        12329 non-null object
dietary_fiber_text        424 non-null object
food_category             31000 non-null object
item_description          31000 non-null object
item_name                 31000 non-null object
kids_meal                 31000 non-null object
limited_time_offer        31000 non-null object
menu_item_id              31000 non-null object
potassium                 389 non-n

In [44]:
results_df.groupby('restaurant')[['restaurant', 'item_description']].count().nlargest(10, 'item_description')['item_description']

restaurant
Starbucks             3849
Dunkin' Donuts        2850
Sheetz                1054
Sonic                  982
Pizza Hut              972
Golden Corral          654
Jersey Mike's Subs     616
Dominos                557
Perkins                553
Jason's Deli           522
Name: item_description, dtype: int64

In [22]:
results_df.calories.describe(include = 'all')

count     25196
unique      992
top           0
freq        924
Name: calories, dtype: object

In [24]:
results_df.calories.isnull().sum()

5804

In [45]:
import numpy as np
results_df[results_df.calories.isnull()][['restaurant', 'calories']].describe()

Unnamed: 0,restaurant,calories
count,5804,0.0
unique,81,0.0
top,Dunkin' Donuts,
freq,1900,


In [48]:
results_df[results_df.restaurant == "Dunkin' Donuts"][['item_description', 'calories']].head()

Unnamed: 0,item_description,calories
6304,"Whole Wheat Bagel, Bagels, Bakery, Food",320
6340,"Bismark, Donuts, Bakery",490
6461,"Jelly Donut, Donuts, Bakery, Food",270
6480,"Chocolate Frosted Cake Donut, Donuts, Bakery, ...",350
6530,"Cinnamon Stick, Donuts, Bakery",380
