# Python: Introduction to Data Analysis with Pandas
  

Welcome to this tutorial about doing data analysis with `pandas`. If you did the Introductory Python tutorial, you'll rememember we briefly looked at the `pandas` package as a way of quickly loading a .csv file to extract some data. This tutorial looks at pandas in some more depth. 

## What is pandas?

Pandas is a package commonly used to deal with data analysis. It simplifies the loading of data from external sources such as text files and databases, as well as providing ways of analysing and manipulating data once it is loaded into your computer. The features provided in pandas automate and simplify a lot of the commonly used tasks that would take many lines of code to write in the basic Python langauge.

_If you have used R's dataframes before, or the NumPy package in Python, you may find some similarities in the Python `pandas` package. But if not, don't worry because this tutorial doesn't assume any knowledge of NumPy or R, only basic-level Python._

Pandas is a hugely popular, and still growing, Python library used across a range of disciplines from environmental and climate science, through to social science, linguistics, biology, as well as a number of applications in industry such as data analytics, financial trading and many others. If you came to the last tutorial, you'll know I'm a fan of these StackOverflow graphs showing usage of programming languages over time. Well, I found another one showing the growth of Pandas compared to some other Python software libraries:

(image here)

Pandas is best suited for structured, __labled__ data, in other words, tabular data, that has headings or names associated with each column of data. The pandas.org website describes its data-handling strengths as:

 - Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
 - Ordered and unordered (not necessarily fixed-frequency) time series data.
 - Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
 - Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure
 
Some other important points to note about pandas are:

 - pandas is __fast__. Python sometimes gets a bad rap for being a bit slow compared to 'compiled' languages such as C and Fortran. But deep down in the internals of Pandas, it is actually written in C, and so processing large datasets is no problem for pandas.
 - pandas is a dependency of another library called `statsmodels`, making it an important part of the statistical computing ecosystem in Python.
 
## What will be covered in this tutorial

 - Aims. 
 

## Conventions for using pandas
All the examples in this tutorial assume you have installed the Python library pandas, either through using a scientific python distribution such as Anaconda/Spyder, or by installing it using a package-manager. If you are writing scripts, it's assumed that you have the import statement at the top of your script like so:

In [2]:
import pandas as pd

Every time we use a pandas feature thereafter, we can shorten what we type by just typing `pd`, such as `pd.some_function()`. Try the following to see which version of pandas you are running:

In [6]:
print(pd.__version__)

0.22.0


## Pandas data structures
Pandas has two core datastructures used to store data: The _Series_ and the _Dataframe_. 

### Series

The series is a one-dimensional array-like structure designed to hold a single array (or 'column') of data and an associated array called of data labels, called an _index_. We can create a series to experiment with just by passing a list of data, let's use numbers in this example:

In [5]:
my_series = pd.Series([4.6, 2.1, -4.0, 3.0])
print(my_series)

0    4.6
1    2.1
2   -4.0
3    3.0
dtype: float64


Note that printing out our _Series_ object prints out the values and the index numbers. If we just wanted the values, we can do this with:


In [9]:
print(my_series.values)

[ 4.6  2.1 -4.   3. ]


And for just the index:

For a lot of applications, a plain old Series is probably not a lot of use, but it is the core component of the pandas workhorse, the _DataFrame_, so it's useful to know about.

### DataFrames
The DataFrame represents tabular data, a bit like a spreadsheet. DataFrames are organised into colums (each of which is a _Series_), and each column can store a single data-type, such as floating point numbers, strings, boolean values etc. DataFrames can be indexed by either their row or column names. (They are similar in many ways to R's `data.frame`.)
