# Data wrangling tutorial for Python with pandas

This pandas tutorial mimicks the familiar [dplyr example with the NYC Flights dataset](https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html) and the [reshape2 example with the Air Quality dataset](http://seananderson.ca/2013/10/19/reshape.html).

## First up: basics, slicing, and aggregating

**R code:**

```R
library(dplyr)
```

**Python equivalent:**

In [1]:
%matplotlib notebook
import pandas as pd

Note that the first line above is only used to tell Jupyter to make plots inline.

**R code:**

```R
df <- read.csv("./data/flights.csv")
```

**Python equivalent:**

**R code:**

```R
dim(df)
```

**Python equivalent:**

**R code:**

```R
head(df)
```

**Python equivalent:**

**R code:**

```R
head(filter(df, month == 1, day == 1))
```

**Python equivalent:**

**R code:**

```R
head(filter(df, month == 1 | month == 2))
```

**Python equivalent:**

**R code:**

```R
head(filter(df, month == 1 | month == 2))
```

**Python equivalent:**

**R code:**

```R
head(arrange(df, desc(arr_delay)))
```

**Python equivalent:**

**R code:**

```R
head(select(df, year, month, day))
```

**Python equivalent:**

**R code:**

```R
head(select(df, year:day))
```

**Python equivalent:**

**R code:**

```R
head(select(df, -(year:day)))
```

**Python equivalent:**

**R code:**

```R
df <- rename(df, tail_num = tailnum)
head(df)
```

**Python equivalent:**

**R code:**

```R
distinct(df, tailnum)
```

**Python equivalent:**

**R code:**

```R
unique_vals <- distinct(df, origin, dest)
```

**Python equivalent:**

**R code:**

```R
mutate(df, 
       gain = arr_delay - dep_delay,
       speed = distance / air_time * 60)
```

**Python equivalent:**

**R code:**

```R
summarise(df, 
          delay = mean(dep_delay, na.rm = TRUE))
```

**Python equivalent:**

**R code:**

```R
sample_n(df, 5)
```

**Python equivalent:**

**R code:**

```R
dim(sample_frac(df, 0.01))
```

**Python equivalent:**

**R code:**

```R
summary <- df %>% group_by(tail_num) %>% 
  summarise(count = n(),
            dist = mean(distance, na.rm = TRUE),
            delay = mean(arr_delay, na.rm = TRUE))
```

**Python equivalent:**

**R code:**

```R
delay <- summary %>% filter(count > 20, dist < 2000)
head(delay)
```

**Python equivalent:**

**R code:**

```R
ggplot(delay, aes(dist, delay)) +
  geom_point(aes(size = count), alpha = 1/2)
```

**Python equivalent:**

**R code:**

```R
destinations <- flights %>% group_by(dest)
  summarise(planes = n_distinct(tail_num),
            flights = n())
```

**Python equivalent:**

## Next up: melting and casting

**R code:**

```R
airquality <- read.csv("./data/airquality.csv")
head(airquality)
```

**Python equivalent:**

**R code:**

```R
aql <- melt(airquality, id.vars = c("month", "day"))
head(aql)
```

**Python equivalent:**

**R code:**

```R
aql <- melt(airquality, id.vars = c("month", "day"),
  variable.name = "climate_variable", 
  value.name = "climate_value")
head(aql)
```

**Python equivalent:**

**R code:**

```R
aql <- melt(airquality, id.vars = c("month", "day"))
aqw <- dcast(aql, month + day ~ variable)
head(aqw)
```

**Python equivalent:**