# Data Analysis Quickstart


## Introduction

Here we'll do a whistlestop tour of what's possible when using code to execute data analysis on a single dataframe using the Star Wars' characters dataset as an example. For a more thorough introduction to data analysis, see the next chapter.

This chapter will use the **pandas** package; if you don't already have it installed you may need to run

```shell
conda install pandas
conda install pyarrow
```

on your computer's command line.

## Loading data and checking datatypes

Loading data into a dataframe is achieved with commands like `df = pd.read_csv(...)` or `df = pd.read_stata(...)`. Let's load the Star Wars data:

In [None]:
import pandas as pd
import os
import numpy as np

# Set seed for reproducibility
np.random.seed(10)
# Set max rows displayed for readability
pd.set_option('display.max_rows', 6)


# Read in data
df = pd.read_feather(os.path.join('data', 'starwars.feather'))
# Check info about dataframe
df.info()

### Look at the first few rows with `head()`

In [None]:
df.head()

## Filter rows and columns with conditions using `df.loc[condition(s), column(s)]`

`.loc` stands for location. By passing conditions to `.loc`, you are able to select and work with a subset of rows in a dataframe. The second argument passes columns of interest (use `:` for all columns).


In [None]:
df.loc[(df['skin_color'] == 'light') & (df['eye_color'] == 'brown'), ['name', 'species']]

## Sort rows or columns with `.sort_values()`

Use `sort_values(columns, ascending=False)` for descending order.

In [None]:
df.sort_values(['height', 'mass'])

## Choose multiple rows or columns using slices

Slices can be passed by name using `.loc[startrow:stoprow:step, startcolumn:stopcolumn:step]` or by position using `.iloc[start:stop:step, start:stop:step]`.

Choosing every 10th row, and the columns between 'name' and 'birth_year':

In [None]:
df.loc[::10, 'name':'birth_year']

Choosing the first 5 rows and the last 2 columns by position:

In [None]:
df.iloc[:5, -2:]

## Randomly selecting a sample using `.sample`

`.sample(n)` randomly selects `n` rows, `.sample(frac=0.4)` selects 40% of the data, `replace=True` samples with replacement, and passing `weights=` selects a number of fraction with the probabilities given by the passed weights.

Taking a sample of 5 rows:

In [None]:
df.sample(5)

## Rename with `.rename`

You can rename all columns by passing a function, for instance `df.rename(columns=str.lower)` to put all columns in lower case. Alternatively, use a dictionary to say which columns should be mapped to what:

In [None]:
df.rename(columns={'homeworld': 'home_world'})

## Add new columns with `.assign` or assignment

Very often you will want to create new columns based on existing columns.

![](https://pandas.pydata.org/docs/_images/05_newcolumn_1.svg)

`.assign` is a function that applies to a dataframe and returns a copy of that dataframe:

In [None]:
df.assign(height_m = df['height']/100)

This was added to the end; ideally, we'd like it next to the height column:

In [None]:
(df.assign(height_m = df['height']/100)
   .sort_index(axis=1))

To overwrite existing columns just use `height = df['height']/100` in the assign function. We can also assign using a series of statements instead of chaining them together using

```python
df['height_m'] = df['height']/100
df = df.sort_index(axis=1)
```

## Summarise numerical values with `.describe()`



In [None]:
df.describe()

## Group variables values with `.groupby()`



In [None]:
df.groupby('species')[['height', 'mass']].mean()

## Add transformed columns using `.transform()`

Quite often, it's useful to put a column into a dataframe that is the result of an intermediate groupby and aggregation. For example, subtracting the group mean or normalisation. Transform does this and returns the column with the same shape as the original dataframe.

Below is an example of transform being used to demean a variable according to the mean by species. Note that we use `pd.notna` to filter out missing species values (otherwise this would result in an error, which is actually *helpful* behaviour) and we are using lambda functions. Lambda functions are a quick way of writing functions, e.g. `lambda x: x+1` defines a function that adds one to x.

In [None]:
(df[pd.notna(df['species'])]
 .assign(height_demean_species = df[pd.notna(df['species'])]
                                 .groupby('species')['mass']
                                 .transform(lambda x: x-x.mean()))
)

## Make quick charts with `.plot.*`

Including scatter, area, bar, box, density, hexbin, histogram, kde, and line.

In [None]:
df.plot.scatter('mass', 'height', alpha=0.5);

In [None]:
df[['species', 'height']].plot.box();

In [None]:
df['height'].plot.kde();