# Read-in the Data

In [None]:
import numpy as np
import pandas as pd

In [None]:
pop_df = pd.read_csv("./data/populations.txt", sep='\t')
pop_df.head(5)

In [None]:
pop_df.shape

## YOUR TURN HERE

Let us investigate the dataset by checking the number of variables/fetures and observations.

**Number of variables/features = number of columns in DF**

**Enter value of X below**

We have X variables in the dataset.

**Number of observations = number of rows in DF**

**Enter value of Y below**

We have Y observations in the dataset.

Let us check the names of the variables embedded in the dataset. Note that sometimes we do not have column names (variable names) in the dataset.

In [None]:
pop_df.columns

We only need the values to feed into the models - we can access the values this way.

In [None]:
pop_df.values

Data type is also an important characteristic of the data, we can access the data types this way.

In [None]:
pop_df.dtypes

We can access columns (Pandas series) using their labels:

In [None]:
hare_df = pop_df["hare"]
hare_df

Or alternatively using the label as a property of the dataframe:

In [None]:
pop_df.hare

# Data Exploration

Data exploration is easier with Pandas.

The usual numeric operations are available for dataframes or series:

In [None]:
print ("Mean Hare Population: ", hare_df.mean())

In [None]:
print ("Mean Populations: \n", pop_df[["hare","lynx","carrot"]].mean())
print ("\n")
print ("Standard Deviations: \n", pop_df[["hare","lynx","carrot"]].std())

The describe() method provides a detailed description of variables:

In [None]:
pop_df[["hare","lynx","carrot"]].describe()

In [None]:
pop_df.describe()

A better way to do correlation analysis:

In [None]:
pop_df[["hare","lynx","carrot"]].corr()

Also sorting is done easily:

In [None]:
pop_df.sort_values(by=['hare'])

More examples of accessing and manipulating data in dataframes:

In [None]:
# finding all instances when the population of hares is above 50k
hare_above_50K = pop_df.hare>50000
print (hare_above_50K)
print ("\n")
print (pop_df[hare_above_50K])
print ("\n")
print (pop_df[hare_above_50K].year)

In [None]:
# finding all instances when the population of one of the animal species is above 50k
above_50K = (pop_df["hare"]>50000) | (pop_df["lynx"]>50000)
print (pop_df[above_50K])
#print pop_df[hare_above_50K].year

We know that the *year* column is only an identifier, so we may not need it in the analysis.

In [None]:
pop2 = pop_df.drop("year", axis=1)
pop2

When necessary, we can convert a dataframe (or a series) into a Numpy array:

In [None]:
poptable = np.array(pop2)
poptable

# Data Visualization

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')

In [None]:
plt.plot(pop_df["year"], pop_df["hare"])

We can also visualize multiple variables/features in one figure. 

But you need to make sure:
- All data visualized in the same figure should be of the same data type (e.g. you cannot mix continuous and categorical data types in the same figure);
- You do not want your visualization to be too busy - below is a good example - but it is **highly discouraged** to include more than **5** variables/features in the same figure.

In [None]:
plt.plot(pop_df["year"], pop2, label=['Hares','Lynxes','Carrots'])
plt.legend( ('Hares','Lynxes','Carrots') )
plt.ylabel('Population')
plt.xlabel('Year')
plt.show()

Line charts look good at visualizing **continuous** variables (particularly when they are time series like what we have here); but they are not useful when dealing with **categorical** variables. 

Below is a way of dealing with **categorical** variables.

In [None]:
plt.hist(pop_df["carrot"], bins=8, alpha=0.5)
plt.xlabel('Carrots')
plt.ylabel('Count')
plt.title('Histogram of Carrot Populaions')
plt.axis([36000, 49000, 0, 6])
#plt.grid(True)

Pandas has its own versatile "plot" method that can handle most types of charts:

In [None]:
pop_df.plot(x="year", title="Populations")

When we want to investigate the cross-variable relationship, we can use **scatterplot** as following.

In [None]:
pop_df.plot(x="carrot", y="lynx", kind="scatter")

## YOUR TURN HERE

Q: Can you explain the relationship between 'lynx' and 'carrot'? Do they have any linear relationship?

A: Double click this cell and enter your answer here.

Boxplot is another visualization tool when investigating the distribution of continuous variables. In a boxplot:
- the box is the **confidence interval** of the values of a certain variable;
- the attenas (above and below the box) are the **actual range (min. to max.)**;
- the line in the box is the **mean (average) value** of the variable.

You can use the boxplot to investigate the distribution of the variable - this is the same as checking the 'bell curve' in the distribution chart. For instance, in the chart below, both 'hare' and 'lynx' are right-skewed, while 'carrot' is in a normal (but narrow) distribution.

In [None]:
pop_df.boxplot(column=["hare","lynx","carrot"], return_type='axes')

In [None]:
fox_col = np.random.randint(low=5000, high=20000, size=21)
fox_col

In [None]:
pop_df["fox"] = pd.Series(fox_col, index=pop_df.index)
pop_df

In [None]:
pop_df.plot(x="year", y="fox", kind="area", title="Fox Population")

In [None]:
pd.plotting.scatter_matrix(pop_df[["hare","lynx","carrot"]], figsize=(14,14), hist_kwds={'bins':8}, alpha=.5, marker='o', s=50)