# pandas Introduction

The pandas package is included with the Anaconda distribution and it is one of the most often used Python packages in data science.  It must be imported and the pd alias is the universal alias for pandas.

In [None]:
import pandas as pd

The two main types of data type in pandas are **Series** and **DataFrames**.  They have similar functionality, the main difference being that Series have only one column of data and DataFrames can have multiple columns of data.  As with numpy ndarrays, each data column can have only one data type (e.g., int32, float64, object).

## A Series with Categorical Data

pd.Series performs a data type conversion, in the case below, from a Python list to a pandas Series data type.

In [None]:
ic = ['chocolate', 'strawberry', 'vanilla', 'rum raisin', 'chocolate', 'vanilla', 'vanilla', 'strawberry', 'rum raisin', 'chocolate', 'strawberry', 'cotton candy', 'chocolate', 'vanilla', 'rum raisin', 'vanilla', 'vanilla', 'strawberry', 'chocolate', 'vanilla', 'chocolate', 'vanilla', 'strawberry', 'vanilla', 'chocolate', 'chocolate', 'purple cow', 'chocolate', 'rum raisin', 'vanilla', 'chocolate', 'bubble gum', 'vanilla']
ic = pd.Series(ic)
ic

In [None]:
type(ic)

In [None]:
ic.head()

In [None]:
ic.head(n=3)

## Series Structure

Series have one data column and an index "column."  The index column is generated automatically when you do the data type conversion, although you can manually change the data in the index column, which is meant primarily for accessing data in the values column and, perhaps, as labels for plotting the data.  The data in the "data" column are accessible with the .values property and the indices are available with the .index property.

In [None]:
ic.values

In [None]:
type(ic.values)

In [None]:
ic.index

In [None]:
ic.size

You can name your Series, if you wish.  The name will be shown when you print the Series.

In [None]:
ic.name = 'Flavor'

In [None]:
ic.name

Collapse or expand the listing of the Series by clicking in the laft margin.

In [None]:
ic

## Operations with Series

A very useful method for analyzing categorical data and plotting it, in a histogram, is the .value_counts() method.  It gives the number of observations found for each value in the data column.

In [None]:
ic.value_counts()

In [None]:
type(ic.value_counts())

In [None]:
ic.value_counts().index

In [None]:
ic.value_counts().values

In [None]:
type(ic.value_counts().values)

In [None]:
ic.unique()

## pandas Makes Graphing Easy

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
fig,ax = plt.subplots()
ax.bar(ic.value_counts().index,ic.value_counts().values)
fig.set_size_inches(10,7)
plt.show()

## A Series with Numerical Data

We, similarly to the example above, create a list and convert it into a pandas Series data type.

These data represent temperatures in degrees Fahrenheit.

In [None]:
temp = [65.5, 37.8, 98.3, 84.8, 52.8, 15.8, 37.4, 72.6, 37.2, 28.5, 74.2, 50.1, 12.7, 79.3, 20.0, 47.6, 96.2, 58.5, 95.8, 34.7, 67.1, 78.8, 13.1, 48.0, 23.1, 69.8, 91.3, 11.3, 16.1, 4.4, 92.2, 82.7, 53.2, 71.1, 93.2, 33.0, 39.7, 16.6, 74.5, 22.2, 80.9, 21.6, 87.3, 2.6, 8.1, 46.6, 24.5, 84.4, 14.3, 69.7]

In [None]:
temp

In [None]:
temp = pd.Series(temp)

In [None]:
type(temp)

## Series Structure

Series have one data column and an index "column."

In [None]:
temp

In [None]:
temp.head()

In [None]:
temp.index

In [None]:
temp.values

In [None]:
temp.size

## Accessing Series Data

Data rows can be accessed either by the 'label' in the index "column" or by their position in the data column.  The .loc command finds data rows based on their label and .iloc finds data based on it position, that is, the sequence in which the rows are found in the Series.  To start, the index column has values starting at 0 and incrementing by 1 with every subsequent row, so .loc and iloc will work identically.

In [None]:
temp.loc[0]

In [None]:
temp.iloc[0]

But, now, let's change the indices!

In [None]:
temp.index = range(51,101,1)

In [None]:
temp.index

The statement below will yield an error since we no long have an index label of 0.

In [None]:
temp.loc[0]

The first record, now, has a label of 51.

In [None]:
temp.loc[51]

In [None]:
temp.iloc[0]

## Series Methods for Numerical Data

In [None]:
print('temp.sum(): ',temp.sum())
print('temp.mean(): ',temp.mean())
print('temp.median(): ',temp.median())
print('temp.product(): ',temp.product())
print('\ntemp.describe(): \n',temp.describe())


In [None]:
temp.sort_values()
temp.head()

Oops, the command above did not change the series.  The Series is not altered unless we set the 'inplace' parameter equal to True. 

In [None]:
saveIt = temp.sort_values()
saveIt.head()
saveIt

In [None]:
temp.sort_values(inplace=True)
temp.head()

Use the ascending parameter to sort in descending order.

In [None]:
temp.sort_values(inplace=True,ascending=False)
temp.head()

You can use the command below to put the data back in their original order, by index, if your indices are sequential integers.

In [None]:
temp.sort_index(inplace=True)
temp.head()

## Appending One Series to Another

Again, the changes are not permanent unless you resave the combined Series back to the original Series variable.

In [None]:
temp_new = pd.Series([0.0])
temp.append(temp_new)
temp

In [None]:
temp_new = pd.Series([0.0])
temp = temp.append(temp_new)
print(temp.size, temp.index)
temp

The .drop() method can be used to delete Series row based on label.

In [None]:
temp.drop(labels=0, inplace=True)
print(temp.size, temp.index)
temp

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

pandas Series make for easy plotting.  The index range automatically serves as the x-axis labels and the data column is used for the y-axis data.

In [None]:
fig,ax = plt.subplots()
ax.plot(temp)
plt.show()

# How I Constructed the Temperature Data

I created some random values using the list comprehension statemetn below, while formatting the data to have one decimal place.  I then copied the data and pasted it into a definition statement for the temp variable.

In [None]:
import numpy as np
temp = [float(f'{0.0 + 100.0 * np.random.random():.1f}')  for i in range(50)]
temp