# 1-D Exploratory Data Analysis

In this notebook, do some EDA in one dimension. Pick a column  (or a set of columns) you're interested in looking at. Calculate some summary statistics (like mean,median,min,max,sd). Then, make some plots to visualize the distribution of the data. Distirbution plots include things like histograms, boxplots, dotplots, beeswarms, and violin plots. Review [ggplot-intro](https://github.com/data4news/ggplot-intro) for examples of these kinds of distribution plots.

### Standard Python and R imports

In [1]:
%load_ext rpy2.ipython
%load_ext autoreload
%autoreload 2

%matplotlib inline  
from matplotlib import rcParams
rcParams['figure.figsize'] = (16, 100)

import warnings
from rpy2.rinterface import RRuntimeWarning
warnings.filterwarnings("ignore") # Ignore all warnings
# warnings.filterwarnings("ignore", category=RRuntimeWarning) # Show some warnings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, HTML

1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 


In [2]:
%%javascript
// Disable auto-scrolling
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

In [3]:
%%R

# My commonly used R imports

require('tidyverse')

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors


Loading required package: tidyverse


## Load the data

In [4]:
%%R
 
# Import data with R
df <- read_csv('active_tobacco_retailer_map_correct_cleaned_with_census.csv', show_col_types = FALSE)
df %>% head(4)

# A tibble: 4 × 15
  LOCAL.HEALTH.UNIT OPERATION.NAME      CITY     STATE   ZIP MUNICIPALITY COUNTY
  <chr>             <chr>               <chr>    <chr> <dbl> <chr>        <chr> 
1 NYC               2918 GAS CORP       BRONX    NY    10469 BRONX        NEW Y…
2 NYC               SHISHA KING CORP    BROOKLYN NY    11207 BROOKLYN     NEW Y…
3 NYC               TUNI'S SERVICE CORP BROOKLYN NY    11215 BROOKLYN     NEW Y…
4 NYC               VISHWA NEWS, INC.   BROOKLYN NY    11201 BROOKLYN     NEW Y…
# ℹ 8 more variables: VENDOR.TYPE <chr>, CREATION.DATE <chr>, LOCATION <chr>,
#   LAT <dbl>, LONG <dbl>, ADDRESS <chr>, census_code <dbl>, census_tract <chr>


## Summary statistics

Pick a column or set of columns and calculate some summary statistics (like mean,median,min,max,sd).
Hint, you may want to use `group_by` and `summarize`.



In [5]:
%%R 

# code for summary statistics
# count the number of entries in each MUNICIPALITY
df %>% 
  group_by(MUNICIPALITY) %>% 
  summarise(n = n()) %>% 
  arrange(desc(n)) 


# A tibble: 6 × 2
  MUNICIPALITY      n
  <chr>         <int>
1 BROOKLYN       2027
2 QUEENS         1552
3 BRONX          1188
4 STATEN ISLAND   344
5 NEW YORK CITY   323
6 QUEENSBURY       35


In [6]:
%%R

# remove rows with MUNICIPALITY = QUEENSBURY
df <- df %>% filter(MUNICIPALITY != 'QUEENSBURY')
df

# A tibble: 5,434 × 15
   LOCAL.HEALTH.UNIT OPERATION.NAME        CITY  STATE   ZIP MUNICIPALITY COUNTY
   <chr>             <chr>                 <chr> <chr> <dbl> <chr>        <chr> 
 1 NYC               2918 GAS CORP         BRONX NY    10469 BRONX        NEW Y…
 2 NYC               SHISHA KING CORP      BROO… NY    11207 BROOKLYN     NEW Y…
 3 NYC               TUNI'S SERVICE CORP   BROO… NY    11215 BROOKLYN     NEW Y…
 4 NYC               VISHWA NEWS, INC.     BROO… NY    11201 BROOKLYN     NEW Y…
 5 NYC               JAMAST HOLDING CORP   FLUS… NY    11358 QUEENS       NEW Y…
 6 NYC               60 ST SMOKE SHOP      BROO… NY    11220 BROOKLYN     NEW Y…
 7 NYC               DELI GRILL VAPE AND … BROO… NY    11226 BROOKLYN     NEW Y…
 8 NYC               37 SMOKE SHOP INC     MANH… NY    10018 NEW YORK CI… NEW Y…
 9 NYC               YESENIA A. RODRIGUEZ… BROO… NY    11237 BROOKLYN     NEW Y…
10 NYC               GNP SUNIL CORP        BRONX NY    10463 BRONX        NEW Y…
# ℹ 5

In [7]:
%%R 

# code for summary statistics
# count the number of entries in each MUNICIPALITY
df %>% 
  group_by(MUNICIPALITY) %>% 
  summarise(n = n()) %>% 
  arrange(desc(n)) 


# A tibble: 5 × 2
  MUNICIPALITY      n
  <chr>         <int>
1 BROOKLYN       2027
2 QUEENS         1552
3 BRONX          1188
4 STATEN ISLAND   344
5 NEW YORK CITY   323


In [None]:
%%R

discrete_variables <- c('vs', 'am', 'gear', 'carb')
# 👉 Select the discrete variables only and make a pivot table for each
# so we know how many cars there are in each category (for example, how many automatic vs manual)?

mtcars %>% 
    select(discrete_variables) %>%
    pivot_longer(discrete_variables, names_to = "variable", values_to = "value") %>% 
    group_by(variable, value) %>% 
    summarize(
        count = n()
    )

## 1-D visualizations (aka distributions)


### Continus variables

For each continuous variable you are interested in, use ggplot to make a plot of the distribution. You can use histograms, dot plots, box plots, beeswarms, etc...(whichever chart type you found most useful). Learn about that variable and give each chart a headline that explains what you're seeing. The chart can also show the mean or median of the variable for reference (for example for a histogram you can add a vertical line through the median).

In [None]:
# code for plot 1
# make sure to make a meaningful title and subtitle

In [None]:
# code for plot 2
# make sure to make a meaningful title and subtitle

In [None]:
# code for plot 3
# make sure to make a meaningful title and subtitle

### Discrete Variables

If there are any discrete variables you'd like to analyze, you can do that with charts here.

In [None]:
# code for plot 1
# make sure to make a meaningful title and subtitle

In [None]:
# code for plot 2
# make sure to make a meaningful title and subtitle

In [None]:
# code for plot 3
# make sure to make a meaningful title and subtitle