# 2. Distributions
One of the best ways to describe a variable is to report the values that appear in the dataset and how many times each value for that variable appears. Such a description is called the distribution of the variable.

In [6]:
from collections import Counter
import pandas as pd
import numpy as np
import os

In [14]:
data_path = os.path.join(os.path.dirname(os.getcwd()),"1.2 Exploratory Data Analysis","nsfg_data.csv")
df = pd.read_csv(data_path)

In [15]:
df.head()

Unnamed: 0,caseid,prglngth,outcome,pregordr,birthord,birthwgt_lb,birthwgt_oz,agepreg,finalwgt,totalwgt_lb
0,1.0,39.0,1.0,1.0,1.0,8.0,13.0,0.3316,6448.271112,8.8125
1,1.0,39.0,1.0,2.0,2.0,7.0,14.0,0.3925,6448.271112,7.875
2,2.0,39.0,1.0,1.0,1.0,9.0,2.0,0.1433,12999.542264,9.125
3,2.0,39.0,1.0,2.0,2.0,7.0,0.0,0.1783,12999.542264,7.0
4,2.0,39.0,1.0,3.0,3.0,6.0,3.0,0.1833,12999.542264,6.1875


# 2.1 Frequecy Tables
When it comes to describing your data, a good first step is to count the frequency of value occurences within each category of a particular categorical variable can also describe your dataset. These counts can be organized in tables known as frequency tables or crosstabs.

Crosstabs can present the frequencies for either a single or multiple categorical variables.

Crosstabs can also display relative frequencies by showing the number of times a certain event occurs in relation to the overall population.

# 2.2 Histograms
The most common representation of a distribution is a histogram, which is a graph that shows the frequency of each value. In this context, "frequency" means the number of times the value appears.

In Python, an efficient way to compute frequencies is with a dictionary. Given a sequence of values, `t`:

In [16]:
t = df['outcome']
hist = {}
for x in t:
    hist[x] = hist.get(x, 0) + 1

In [19]:
hist

{1.0: 9148, 2.0: 1862, 4.0: 1921, 5.0: 190, 3.0: 120, 6.0: 352}

The result from the code above is a dictionary that maps each unique value in `outcome` to a that value's frequency. 

An alternative approach is to use the `Counter` class from the `collections` module:

In [17]:
counter = Counter(t)

In [18]:
counter

Counter({1.0: 9148, 2.0: 1862, 4.0: 1921, 5.0: 190, 3.0: 120, 6.0: 352})

And of course in the previous lesson we saw how we could also use `value_counts`:

In [20]:
df['outcome'].value_counts()

1.0    9148
4.0    1921
2.0    1862
6.0     352
5.0     190
3.0     120
Name: outcome, dtype: int64