# Introduction to Data Analysis


## Introduction

The ability to extract, clean, and analyse data is one of the core skills any economist needs. Fortunately, the (open source) tools that are available for data analysis have improved enormously in recent years, and working with them can be a delight--even the most badly formatted data can be beaten into shape.

In this chapter, you'll get an introduction to the [**pandas**](https://pandas.pydata.org/) package, the core data manipulation library in Python. The name is derived from 'panel data' but it's suited for any tabular data, and can be used to work with more complex data structures too.

It's worth noting that the focus in this chapter will be *exclusively* on tabular data. But other data types do exist! Pandas may not be the best tool for those.

This chapter is hugely indebted to the fantastic [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/) by Jake Vanderplas.

Remember, if you get stuck with pandas, there is brilliant [documentation](https://pandas.pydata.org/docs/user_guide/index.html) and a fantastic set of [introductory tutorials](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/index.html) on their website.

If you haven't already, you'll need to download and install **pandas**. Remember, to install a package, you need to open up a terminal and use `pip install pandas`.

## Dataframes and series

Let's start with the absolute basics. The most basic **pandas** object is a dataframe. A DataFrame is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data, even lists) in columns. Here's a dataframe of the *penguins* dataset.


In [None]:
import seaborn as sns
import numpy as np
import pandas as pd
import os

# Set seed for reproducibility
np.random.seed(10)

df = sns.load_dataset('penguins')
df

What just happened? We loaded a pandas dataframe called `df` and showed its contents. You can see the column names in bold, and the index on the left hand side. Just to double check it *is* a pandas dataframe, we can call type on this.

In [None]:
type(df)

And if we want a bit more information about what we imported (including the datatypes of the columns):

In [None]:
df.info()

Remember that everything in Python is an object, and our dataframe is no exception. Each dataframe is made up of a set of series that, in a dataframe, become columns: but you can turn a single series into a dataframe too. Let's see a few ways of creating some series from raw data:

In [None]:
# From a list:
s1 = pd.Series([1., 6., 19., 2.])
print(s1)
# From a range
s2 = pd.Series(range(10))
print(s2)
# From a dictionary
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
s3 = pd.Series(population_dict)
print(s3)

Note that in each case there is no column name (because this is a series, not a dataframe), and there *is* an index. The index is automatically created if we don't specify it; in the third example, by passing a dictionary we implicitly asked for the index to be the locations we supplied. 

If you ever need to get the data 'out' of a series or dataframe, you can just call the `values` method on the object:

In [None]:
s3.values

If you ever want to turn a series into a dataframe, just called `pd.DataFrame(series)` on it. 

Now let's try creating our own dataframe with more than one column of data using a *dictionary*:

In [None]:
df = pd.DataFrame({'A': 1.,
                   'B': pd.Series(1, index=list(range(4)), dtype='float32'),
                   'C': [3] * 4,
                   'D': pd.Categorical(["test", "train", "test", "train"]),
                   'E': 'foo'})
df

Another way to create dataframes is to pass a bunch of series (note that `index`, `columns`, and `dtype` are optional--you can just specify the data):

In [None]:
df = pd.DataFrame(data=np.reshape(range(36), (6, 6)),
                  index=['a', 'b', 'c', 'd', 'e', 'f'],
                  columns=['col' + str(i) for i in range(6)],
                  dtype=float)
df

## Accessing and slicing

Now you know how to put data in a dataframe, how do you access the bits of it you need? There are various ways. If you want to access an entire column, the syntax is very simple; `df['columname']` (you can also use `df.columname`). To access a particular row, it's `df.loc['rowname']` or `df.iloc[row]` where `row` is the row number. 

Let's see these using a dataframe:


In [None]:
# Column:
print(df['col1'])
# Alternative column:
print(df.col1)
# Row
print(df.loc['a'])
# Row by nunber
print(df.iloc[0])

To access an individual value from within the dataframe, we have two options: pass an index value and a column name to `.loc[rowname, columnname]` or retrieve the value by using its position using `.iloc[row, column]`: 

In [None]:
# Using .loc
print(df.loc['b', 'col1'])
# Using .iloc
print(df.iloc[1, 0])

So often what we really want is a subset of values (as opposed to *all* values or just *one* value). This is where *slicing* comes in. If you've looked at the Basics of Coding chapter, you'll know a bit about slicing and indexing already, but we'll cover the basics here too.

The syntax for slicing is similar to what we've seen already: there are two methods `.loc` to access items by name, and `.iloc` to access them by position. The syntax for the former is `df.loc[start:stop:step, start:stop:step]`, where the first position is index name and the second is column name (and the same applies for numbers and `df.iloc`). Let's see an example using the storms dataset, and do some cuts.

In [None]:
df.loc['a':'f':2, 'col1':'col3']

As you can see, slicing even works on names! By asking for rows `'a':'f':2`, we get every other row from 'a' to 'f' (inclusive). Likewise, for columns, we asked for every column between `col1` and `col3` (inclusive). `iloc` works in a very similar way.

In [None]:
df.iloc[1:, :-1]

In this case, we asked for everything from row 1 onwards, and everything up to (but excluding) the last column.

It's not just strings and positions that can be sliced though, here's an example using *dates* (pandas support for dates is truly excellent):

In [None]:
index = pd.date_range('1/1/2000', periods=16, freq='Q')
df = pd.DataFrame(np.random.randint(0, 10, (16, 5)), index=index, columns=list('ABCDE'))
df

Now let's do some slicing!

In [None]:
df.loc['2000-01-01':'2003-01-01', :]

Two important points to note here: first, pandas doesn't mind that we supplied a date that didn't actually exist in the index. It worked out that by '2000-01-01' we meant a datetime and compared the values of the index to that datetime in order to decide what rows to return from the dataframe. The second thing to notice is the use of `:` for the column names; this explicitly says 'give me all the columns'.

## Operations on dataframes

Columns in dataframes can undergo all the usual arithmetic operations you'd expect of addition, multiplication, division, and so on. If the underlying datatypes of two columns have a group operation, then the dataframe columns will use that. The results of these manipulations can just be saved as a new series, eg, `new_series = df['A'] + df['B']` or created as a new column of the dataframe:

In [None]:
df['new_col'] = df['A']*(df['B']**2) + 1
df

Boolean variables and strings have group operations (eg concatenation is via `+` with strings), and so work well with column operations too:

In [None]:
df = pd.DataFrame({'a': [1, 0, 1], 'b': [0, 1, 1], 'c': [0, 1, 1], 'd': [1, 1, 0]}, dtype=bool)
print(df)
print('\n A and C:\n')
print(df['a'] & df['c'])
print('\n B or D:\n')
print(df['b'] | df['d'])

More complex operations on whole dataframes are supported, but if you're doing very heavy lifting you might want to just switch to using numpy arrays (**numpy** is basically Matlab in Python). As examples though, you can transpose and exponentiate easily:

In [None]:
df = pd.DataFrame(np.random.randint(0, 5, (3, 5)), columns=list('ABCDE'))
print('\n Dataframe:')
print(df)
print('\n Exponentiation:')
print(np.exp(df))
print('\n Transpose:')
print(df.T)

## Aggregation

**pandas** has built-in aggregation functions such as

| Aggregation      | Description |
| ----------- | ----------- |
| `count()`      | Number of items       |
| `first()`, `last()` | 	First and last item |
| `mean()`, `median()` |	Mean and median |
| `min()`, `max()` |	Minimum and maximum |
| `std()`, `var()` |	Standard deviation and variance |
| `mad()` |	Mean absolute deviation |
| `prod()` |	Product of all items |
| `sum()`	| Sum of all items |

these can applied to all entries in a dataframe, or optionally to rows or columns using `axis=0` or `axis=1` respectively.


In [None]:
df.sum(axis=0)

## Split, apply, and combine

Splitting a dataset, applying a function, and combining the results are three key operations that we'll want to use again and again. Splitting means differentiating between rows or columns of data based on some conditions, for instance different categories or different values. Applying means applying a function, for example finding the mean or sum. Combine means putting the results of these operations back into the dataframe, or into a variable. The figure gives an example

![](https://jakevdp.github.io/PythonDataScienceHandbook/figures/03.08-split-apply-combine.png)

Note that the 'combine' part doesn't always have to result in a new dataframe; it could create new columns in an existing dataframe.

Let's first see a really simple example of splitting a dataset into groups and finding the mean across those groups using the *penguins* dataset. We'll group the data by island and look at the means. 

In [None]:
df = sns.load_dataset('penguins')
df.groupby('island').mean()

The aggregations from the previous [part](#aggregations) all work on grouped data. An example is `df['body_mass_g'].groupby('island').std()` for the standard deviation of body mass by island.

You can also pass other functions via the `agg` method (short for aggregation). Here we pass two numpy functions:


In [None]:
df.groupby('species').agg([np.mean, np.std])

Multiple aggregations can also be performed at once on the entire dataframe by using a dictionary to map columns into functions. You can also group by as many variables as you like by passing the groupby method a list of variables. Here's an example that combines both of these features:


In [None]:
df.groupby(['species', 'island']).agg({'body_mass_g': 'sum', 'bill_length_mm': 'mean'})

Sometimes, inheriting the column names becomes problematic. There's a slightly fussy syntax to help with that:


In [None]:
df.groupby(['species', 'island']).agg(count_bill=('bill_length_mm', 'count'),
                                      mean_bill=('bill_length_mm', 'mean'),
                                      std_flipper=('flipper_length_mm', np.std))

Finally, you should know about the `apply` method, which takes a function and applies it to a given axis (`axis=0` for index, `axis=1` for columns) or column. The simple example below shows how it works, though in practice you'd just use `df['body_mass_kg'] = df['body_mass_g]/1e3` to do this.

In [None]:
def g_to_kg(mass_in_g):
    return mass_in_g/1e3

df['mass_in_kg'] = df['body_mass_g'].apply(g_to_kg)
df.head()

## Filter, transform, apply, and assign

Filtering does exactly what it sounds like, but it can make use of group-by commands. In the example below, all but one species is filtered out.


In [None]:
def filter_func(x):
    return x['bill_length_mm'].mean() > 48

df.groupby('species').filter(filter_func).head()

In [None]:
index = pd.date_range('1/1/2000', periods=12, freq='Q')
data = np.random.randint(0, 10, (12, 2))
df = pd.DataFrame(data, index=index, columns=['values1', 'values2'])
df['type'] = np.random.choice(['group' + str(i) for i in range(3)], 12)
df

Transforms return a transformed version of the data that has the same shape as the input. This is useful when creating new columns that depend on some grouped data, for instance creating group-wise means. Here's an example using the datetime group to subtract a yearly mean. First let's create some synthetic data with some data, a datetime index, and some groups:

Now we take the yearly means by type. `pd.Grouper(freq='A')` is an instruction to take the `A`nnual mean using the given datetime index.

In [None]:
df['v1_demean_yr_type'] = (df.groupby([pd.Grouper(freq='A'), 'type'])['values1']
                             .transform(lambda x: x - x.mean()))
df

You'll have seen there's a `lambda` keyword here. Lambda (or anonymous) functions have a rich history in mathematics, and were used by scientists such as Church and Turing to create proofs about what is computable *before electronic computers existed*. They can be used to define compact functions:

In [None]:
multiply_plus_one = lambda x, y: x*y + 1
multiply_plus_one(3, 4)

Both regular functions and lambda functions can be used with the more general apply method, which takes a function and applies it to a given axis (`axis=0` for index, `axis=1` for columns):

In [None]:
df['val1_times_val2'] = df.apply(lambda row: row['values1']*row['values2'], axis=1)
df

Assign is a method that allows you to return a new object with all the original columns in addition to new ones. Existing columns that are re-assigned will be overwritten. This is *really* useful when you want to perform a bunch of operations together in a concise way and keep the original columns. For instance, to demean the 'values1' column by year-type and to recompute the 'val1_times_val2' column using the newly demeaned 'values1' column:

In [None]:
df.assign(values1=(df.groupby([pd.Grouper(freq='A'), 'type'])['values1']
                     .transform(lambda x: x - x.mean())),
          val1_times_val2=lambda x: x['values1']*x['values2'])

## Time, windows, and rolling

The support for datetimes in pandas is really excellent. The [relevant part]() of the documentation has more info; here we'll just see a couple of the most important bits. First, let's create some synthetic data to work with:

In [None]:
index = pd.date_range('1/1/2000', periods=24, freq='M')
def recursive_ts(n, x=0.05, beta=0.6, alpha=0.2):
    shock = np.random.normal(loc=0, scale=0.6)
    if(n==0):
        return beta*x + alpha + shock
    else:
        return beta*recursive_ts(n-1, x=x) + alpha + shock

t_series = np.cumsum([recursive_ts(n) for n in range(24)])
df = pd.DataFrame(t_series, index=index, columns=['values'])
df.loc['2000-08-31', 'values'] = np.nan
df

Now let's imagine that there are a number of issues with this time series. First, it's been recorded wrong: it actually refers to the start of the next month, not the end of the previous as recorded; second, there's a missing number we want to interpolate; third, we want to take the difference of it to get to something stationary; fourth, we'd like to add a lagged column. We can do all of those things!


In [None]:
# Change freq to next month start
df.index + pd.tseries.offsets.DateOffset(days=1)

df['values'] = df['values'].interpolate(method='time')
df['diff_values'] = df['values'].diff(1)
df['lag_diff_values'] = df['diff_values'].shift(1)
df


Another useful function to be aware of is `resample`, which can take upsample or downsample time series...

## Method chaining

Sometimes, rather than splitting data out into multiple lines, it can be more concise and clear to chain methods together. A typical time you might do this is when reading in a dataset and perfoming all of the initial cleaning. Tom Augsperger has a [great tutorial](https://tomaugspurger.github.io/method-chaining) on this, which I've reproduced parts of here.

To chain methods together, both the input and output must be a pandas dataframe. Many functions already do input and output these, for example the `df.rename(columns={'old_col': 'new_col'})` takes in `df` and outputs a dataframe with one column name changed.

But occassionally, we'll want to use a function that we've defined (rather than an already existing one). For that, we need the `pipe` method; it 'pipes' the result of one operation to the next operation. When objects are being passed through multiple functions, this can be much clearer. Compare, for example,

```python
f(g(h(df), g_arg=a), f_arg=b)
```

that is, dataframe `df` is being passed to function `h`, and the results of that are being passed to a function `g` that needs a key word argument `g_arg`, and the results of *that* are being passed to a function `f` that needs keyword argument `f_arg`. The nested structure is barely readable. Compare this with

```python
(df.pipe(h)
   .pipe(g, g_arg=a)
   .pipe(f, f_arg=b)
)  
```

Let's see a method chain in action on a real dataset so you get a feel for it. TODO use github path.

In [None]:
df = pd.read_csv(os.path.join('data', 'flights1kBTS.csv'), index_col=None)
df.head()

In [None]:
 def extract_city_name(df):
    '''
    Chicago, IL -> Chicago for origin_city_name and dest_city_name
    '''
    cols = ['origin_city_name', 'dest_city_name']
    city = df[cols].apply(lambda x: x.str.extract("(.*), \w{2}", expand=False))
    df = df.copy()
    df[['origin_city_name', 'dest_city_name']] = city
    return df

def time_to_datetime(df, columns):
    '''
    Combine all time items into datetimes.

    2014-01-01,0914 -> 2014-01-01 09:14:00
    '''
    df = df.copy()
    def converter(col):
        timepart = (col.astype(str)
                       .str.replace('\.0$', '')  # NaNs force float dtype
                       .str.pad(4, fillchar='0'))
        return pd.to_datetime(df['fl_date'] + ' ' +
                               timepart.str.slice(0, 2) + ':' +
                               timepart.str.slice(2, 4),
                               errors='coerce')
    df[columns] = df[columns].apply(converter)
    return df
 
 df = (df
       .drop([x for x in df.columns if 'Unnamed' in x], axis=1)
       .rename(columns=str.lower)
       .pipe(extract_city_name)
       .pipe(time_to_datetime, ['dep_time', 'arr_time'])
       .assign(fl_date=lambda x: pd.to_datetime(x['fl_date']),
               dest=lambda x: pd.Categorical(x['dest']),
               origin=lambda x: pd.Categorical(x['origin']),
               tail_num=lambda x: pd.Categorical(x['tail_num']),
               arr_delay=lambda x: pd.to_numeric(x['arr_delay']),
               op_unique_carrier=lambda x: pd.Categorical(x['op_unique_carrier']),
               arr_delay_demean=lambda x: x['arr_delay'] - x['arr_delay'].mean(),
               distance_group=lambda x: (pd.qcut(x['distance'],
                                                 4,
                                                 labels=["near", "less near", "far", "furthest"])))
        )
df.head()

We'll try and do a number of operations in one go: putting column titles in lower case, discarding useless columns, creating precise depature and arrival times, turning some of the variables into categoricals, creating a demeaned delay time, and creating a new categorical column for distances according to quantiles that will be called 'near', 'less near', 'far', and 'furthest'.

## Reading and writing data

![From the pandas documentation](https://pandas.pydata.org/pandas-docs/stable/_images/02_io_readwrite1.svg)

There are a huge range of input and output formats available in **pandas**: Stata (.dta), Excel (.xls, .xlsx), csv (tab or comma or whatever, use the `sep=` keyword), big data formats (HDF5, parquet), JSON, SAS, SPSS, SQL, and more; there's a [full list](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) of formats available in the documentation. The syntax is usually very similar and `df = pd.read_format('filepath')` i.e. `df = pd.read_stata('path/to/statafile.dta')` for Stata. 

You may wonder what file format to use. Note that data formats differ in whether they are text based (csv, JSON) or binary (encoded and compressed, like Python's pickle format). The former tend to be more easy to use with a range of tools, while the latter are usually faster to read/write and take up less space on disk.

There's a lot to be said for plain old csv because it's interoperable between so many different software tools. It's only trouble is that it's not very efficiency with space, and it's not 'intelligent' about column datatypes. If you want a format for intermediate data within a project, I tend to recommend parquet, which scales well to big data (it is very efficient with disk space) and has excellent read and write speeds. Feather is interoperable between Python and R and may also be a good choice. Depending on the structure of your data, you may find JSON a good option too.

It's best *not* to use formats associated with proprietary software, especially if the standard may change over time (Stata files change with the version of Stata used!!) or if opening the data in that tool might change it (hello Excel). It's also good practice *not* to use a data storage format that cannot easily be opened by other tools. For this reason, I don't recommend Python's pickle format or R's RDA format.

In [None]:
df.columns

### Variable values

We should take special care with the values of the variables we're interested in: shares, and num\_imgs and num\_videos. Looking at those three in the profiling section, we can see that shares is heavily skewed, with most of the mass around (but greater than zero) and a very small number of larger values. This suggests taking the log of this variable. Secondly, we see the num\_videos parameter has a 95th percentile of 6, a very small number. The distribution is very skewed, with most values being zero. This suggests whether an article contains a video or not might be a better measure of 'videos' than the number of them. Finally, num\_imgs has quite a skewed distribution too, but it's not quite as extreme or centred on zero--so we'll leave this variable as it is.

Let's put those changes through:

In [None]:
import numpy as np
df['log_shares'] = np.log(df['shares'])
df['video'] = df['num_videos']>0
df['video'] = df['video'].astype('category')

### Summary

Reading in, exploring, and cleaning data are usually the hardest parts of (economic) data science. Data can catch you out so easily, and in so many different ways. The example above is very bespoke to the data, and this is typical of all data cleaning and preparation exercises. We didn't even see common operations like merging two different datasets! The best advice I can give is to start experimenting with data cleaning in order to come across some common themes.

Remember:
- understanding your data is most of the battle--running a model on cleaned data is the easy part
- how you read, explore, and clean your data will depend entirely on the question you are trying to answer
- 

## Analysis

We will now try to explain the number of shares ('shares' in the dataset) of an article based on characteristics of the article in the dataset. Specifically, we are interested in whether having rich media, such as images and video, helps increases the shares of articles. We can do that by using ordinary least squares (OLS) to regress shares on the variables representing the amount of rich media content. 

Let's start with the simplest model we can think of, which is just regressing the log(shares) on the fixed effects from the weekday and data channel as well as the number of images, number of videos, and number of links to other articles. We'll use the [**statsmodels**] package and its formula API. This lets us use text to specify a model we want to estimate. Putting 'C(variable_name)' into the formula tells statsmodels to treat that variable as a categorical variable, also known as a fixed effect.


In [None]:
import statsmodels.formula.api as smf
model_basic = smf.ols('log_shares ~ C(data_channel)  + C(wkday) + num_imgs + C(video) + num_hrefs + C(quarter)', data=df)
results_basic = model_basic.fit(cov_type='HC1')
print(results_basic.summary())

So it looks like a unit increase in the number of images is associated with a 0.46% increase in the number of shares.

However, there are a LOT of other variables in this dataset that we haven't used. Omitting them could be influencing the parameters we're seeing. So actually the first thing we should be doing is considering whether we need to include these other variables. As many of them could also have an influence on shares, we probably should--but there are just so many!

The easiest way to thing about them is to break them down into similar groups of variables. There are some that count tokens (eg individual words), some looking at sentiment and polarity, and some looking at the title of the article. Then there are a few miscellaneous ones left over (such as url, which we can safely not use in the regression). Let's try and group these use Python's list comprehensions.


In [None]:
token_vars = [x for x in df.columns if 'token' in x]
sentiment_vars = [x for x in df.columns if ((('sentiment' in x) or ('polarity' in x)) and ('title' not in x))]
keyword_vars = [x for x in df.columns if 'kw' in x]
title_vars = [x for x in df.columns if (('title' in x) and (x not in token_vars))]
# Let's look at one of these as an example:
print(', \t'.join(title_vars))

Great, there are now four distinct groups of variables in addition to the ones that were already considered.

We *could* just throw everything into a model (the kitchen sink approach) but some of the variables in the data are likely to be very highly correlated, and multi-collinearity will create issues for our regression. Let's first look at whether any of the variables we haven't already discussed are highly correlated. Just taking the correlation of *all* of the variables will create a huge matrix, so we'll also cut it down to pairs that are highly correlated.


In [None]:
corr_mat = df[token_vars + title_vars + sentiment_vars + keyword_vars].corr()
# Grab any pairs that are problematic
corr_mat[((corr_mat>0.7) | (corr_mat<-0.7)) & (corr_mat != 1)].dropna(how='all', axis=0).dropna(how='all', axis=1).fillna('')

It's clear from this there are quite a few pairs of correlated variables within each group of variables, and we should be cautious about lumping them all in together.

We have many choices at this point but, without going into too much detail, we could either remove some of the independent variables or combine them. In this case, we think that there could still be useful information in the variables so we'd like to keep them but whittle down their information to fewer variables. This sounds like a job for unsupervised machine learning!

### Dimensional reduction

We will make use of the UMAP algorithm to take the sets of variables we've identified, which consists of 25 variables in total, and squish them down to just four dimensions (variables), on the basis that we there are probably only really four different bits of information here.

We'll also make use of a scaler, an algorithm that puts the different data on the same scale. This helps the UMAP algorithm more effectively perform dimensional reduction

This gives us a slightly different value for the impact of the number of videos on the percentage of shares of an article: 0.38%, versus 0.46% from earlier. How can we square these? They did use slightly different specifications. In fact, there were many choices of specification we could have made throughout this process. This garden of forking paths is a problem if we want to have confidence in the relationship that we're interested in; the results should not be fragile to small changes in specification.

Fortunately, there are ways to think about this more comprehensively. One trick is to use *specification curve analysis*. This looks at a range of plausible specifications and plots them out. By comparing so many specifications, we get a better idea of whether the preferred specification is a fragile outlier or a robust results.

We'll create a specification curve for the association between the number of images and the number of shares using the [**specification_curve**](https://specification-curve.readthedocs.io/en/latest/readme.html) package (disclaimer: I wrote this package!).

In [None]:
from specification_curve import specification_curve as specy
sc = specy.SpecificationCurve(df, 'log_shares', 'num_imgs',
                              ['umap_0', 'umap_1', 'umap_2', 'umap_3', 'num_hrefs', 'wkday', 'data_channel', 'quarter', 'video'],
                              always_include=['video', 'num_hrefs'])
sc.fit()

In [None]:
sc.plot(preferred_spec=['log_shares', 'num_imgs', 'umap_0', 'umap_1', 'umap_2', 'umap_3',
                        'num_hrefs', 'wkday', 'data_channel', 'quarter', 'video'])

Looking at the specification curve, we can see that most estimates are clustered around the 0.35%--0.50% range *if* the number of links and video fixed effect are both included as regressors. These are both similar to the variable we're interested in, so it seems reasonable to always include them. The preferred specification is right at the lower end of the range, but includes all of the controls.

### Summary



## Presenting results