# Introduction to Pandas

This workshop will be focused on the exploration, cleaning and basic visualisation of a prepared set of sample data - the Canberra Climate Sensor Data. If you have not already downloaded this data from the [Github](https://github.com/resbaz/Sept2017_PandasWorkshop), please do so now, and place this data file in the same folder as this jupyter notebook. 

## Learning objectives

Thorughout this session we're going to be teaching you a range of tools and skills related to cleaning, manipulation and visualising large datasets. Using the pandas package, we can read in and manipulate large spreadsheets of data, and matplotlib lets you visualise these datasets in a useable, customisable format.

The three sections I'll be taking you through today are:

- Data examination
- Dataframe manipulation
- Plotting your data with pyplot


## Setting Up

First, we need to import Python's *pandas*, *matplotlib* and *numpy* packages, and then use inline plotting "magic" command so that all plots generated will appear within this notebook instead of in a new browser tab.

While numpy isn't directly related to this course, it's handy for generating random values, which will be useful when learning how to create your own dataframe

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')

In [None]:
%pylab inline

## A Basic Pandas Introduction

- creating your own dataframe
    - the "series" object
- subsetting columns
- subsetting rows
    - `head()` and `tail()`
    - slicing
    - `loc` vs `iloc`


### Creating a Dataframe

A Pandas dataframe can be thought of as a collection of lists (of equal length), where each list makes up a column inside your dateframe. 

The key difference between a list and a Pandas column though, is that every single item inside your column _must be of the same data type_. If you have a column of integers, and a single string value, like this, `[1,2,3,4,'seven']`, then every single value inside your column is going to be a string-type.

Instead of using a list - which can take any type of values -, Pandas performs this type-coercion by using a data type called a Series.

In [None]:
pd.Series([1,2,3,4,5])

Each "Series" object can be thought of as it's own miniature dataframe. So our previous example would be a dataframe with one column, and 5 rows.

Therefore, when creating a dataframe, you actually have to create it as a collection of these "Series" objects

In [None]:
df = pd.DataFrame(
         {'A':['a', 'b', 'c','d','e','f'], 
             'B':[54, 67, 89, 100, None, 64],
             'C': np.random.randn(6),
             'D': [2.6,None, 8.0, 9.4, 3.3, None]
                },
          index=[49, 48, 47, 1, 2, 3] # Lets you set your row names
            )

In [None]:
df

## Working with Data

### Reading in Data

The first step to any data exploration and manipulation is to open your data within your program. We are going to do this using the **pandas** package, which reads in spread-sheet style data and converts them into *dataframes*. 

These dataframes work with rows and columns, like a spreadsheet, except that all data within a single column has to be the same data type. 

For example, imagine you had a spreadsheet containing two columns - "labels" and "numbers", and that the rows in the "labels" column contains either a text or number sequence. Because you cannot turn text into a number, every single row in that "labels" column would need to be a string (text) type. Similarly, if some (but not all) of the rows in the "numbers" column contained decimals, **all** of the rows within this column would need to be of a decimal (float) data type.

To read in a comma-separated file, or \*.csv, you can use the pandas function read_csv()

In [None]:
weather_observations = pd.read_csv('Canberra_observations.csv')

You can also open a variety of other file types using the "reader" functions found in this [IO tools documentation](http://pandas.pydata.org/pandas-docs/version/0.20/io.html "Pandas IO tools"). 

This includes file types such as excel (\*.xlsx) files, and text (\*.txt) files. You can use the parameters within these functions to specify file or data attributes such as column separators, whether there's column/row names, and even specifying your own column names.

In [None]:
weather_observations.head()

Oops! Seems as though this file is tab separated, not comma separated. 

Fortunately, pandas.read_csv() has a range of keyword arguments to read in our data.

- `sep`: the type of separator between our columns
-  `header`: the row number to take as the start of the data. Useful if you have metadata attached at the beginning of your file
- `names`: You can also specify your column names. Takes a list of values. If your file contains no header, also use `header = none`. Otherwise, use `header = 0`.
* `parse_dates`: Treat one or more columns like dates.
* `dayfirst`: Use DD.MM.YYYY format, not month first.
* `infer_datetime_format`: Tell pandas to guess the date format.
- `na_values`: Specify values to be treated as empty.



In [None]:
weather_observations = pd.read_csv('Canberra_observations.csv',
                                   sep='\t',
                                   na_values=['-'],
                                   parse_dates={'Datetime': ['Date', 'Time']},
                                   dayfirst=True, #redundant in this instance
                                   infer_datetime_format=True, #redundant in this instance
                                  )
# Display some entries
weather_observations.head()

### Examining your Data

- `head()`, `tail()`
- `columns`
- `df.Column` vs `df['Column']`
- `dtypes`
- `describe()`
- `shape`; `shape[0]` vs. `shape[1]`
- `iloc` vs `loc`

One of the first steps in exploring your data is to see what it looks like, what data types are present, and how many rows/columns there are.

In [None]:
weather_observations.tail()

You can also specify how many rows you want `head` and `tail` to return

In [None]:
weather_observations.head()

The `shape` function gives you the dimensions of your data, in the form `(#rows, #columns)`.

In [None]:
# This returns a tuple (or linked pairs) of the number of rows and columns in your dataframe.
weather_observations.shape

So we can see here that we have 12 columns, and 19,918 rows of data.



In [None]:
# Calling the first or second element of the tuple can give you either the rows or the columns
#Gives you the rows (remember, 0 indexing!)
print(weather_observations.shape[0])

#Gives you the columns
print(weather_observations.shape[1])

You can also use `df.columns` to examine the column names.

In [None]:
weather_observations.columns

You can also examine the data types inside your columns using `dtypes`

In [None]:
weather_observations.dtypes

In [None]:
# What an "object" data type??
weather_observations.head()

#### Subsetting Columns

To select a particular column in your dataframe, you can use one of two options:

1) Calling the column as an "attribute"
    - `df.columnName`

2) Subsetting the dataframe
    - `df['columnName']`

The first option is useful for some functions, but the second form is essential if you want to call more than column at once. You do this by inserting a list, [], of column names, instead a single column.

For example: `df[["Column1", "Column2", ... , etc]]`

In [None]:
weather_observations.Tmp.head() #you can also "chain" commands

In [None]:
weather_observations["Wind dir"].head()

In [None]:
weather_observations.columns

In [None]:
# Try subsetting the first two columns
weather_observations[["Datetime","Wind dir"]].head()

In [None]:
# What happens if you change the order of the columns around?
weather_observations[["Wind dir","Datetime"]].head()

In [None]:
# What about a column that doesnt exist?
weather_observations.Wind

#### Slicing

Sometimes you need to examine specific rows and columns in the middle of your data though, which aren't covered by `head()` or `tail()`. Instead, you can use index slicing.

Slicing works similarly to how you might slice a string, or a list. You simply call the indexes of the rows you want from the dataframe: `df[rowNumbers]`

In [None]:
weather_observations[:3]

In [None]:
weather_observations[["Pres","rh","Fire"]][:10]

#### `loc` vs `iloc`

You can also use `loc` and `iloc` to slice rows and columns.

`df.iloc[]` is positional based, so takes integer values that correlate with the row and column **numbers** in your dataframe

`df.loc[]` is label based, and takes the row and column **names** as inputs

For this example we're going to use our toy dataframe, df, as an example

In [None]:
df

In [None]:
#Using iloc to get rows 0, 1 and 2
df.iloc[]

In [None]:
# Using loc to get row labels 1, 2 and 3
df.loc[]

If you only enter one set of values into `loc` and `iloc`, they will return the values for every column in your dataset.

By using a second integer though, you can choose which rows and columns you specifically want to subset. The first value corresponds to the row, and the second to the columns, or rows x columns. You can also think of this with the moniker *"Roman Catholic"*

In [None]:
# Using loc to get row labels 1, 2 and 3 for Columns A and B
df.loc[1:3,["A","B"]]

In [None]:
# Using iloc to get rows 0, 1 and 2 for columns 0 and 1.
df.iloc[:3,:2]

The differences between `loc` and `iloc` can seem minor, but they're very important, and which one you should use depends on what your needs are at the time.

`iloc` is based on dataframe position, so calling `iloc[:3]` would give you rows 0 through 3. 

`loc` however is based on the index label, so if you were to call `df.loc[:3]`, it would give you all rows UP TO the row labelled as index 3.

While the indexes are in order and all present, this isn't an issue. Consider what happens when the indexes are out of order though, with our dataframe 'df'

In [None]:
# Can see that this only takes the first 2 rows
df.iloc[:2]

In [None]:
# Whereas this takes all rows UP TO index label 2
df.loc[:2]

Since `loc` is based on labels, if you try to subset a row or column label that doesn't exist, even if it corresponds to a positional row, python will throw you an error

In [None]:
df.loc[0]

Due to this, it's important that you carefully consider which tool is appropriate for your needs

#### Challenge

Find the 11th through 20th (inclusive) rows for the columns related to wind variables using each of loc, iloc and slicing

In [None]:
#slicing


In [None]:
#loc


In [None]:
#iloc


#### Subsetting with Conditionals

You can also subset using conditional statements, such as ==, !=, >, <, etc.

For example, if I want to find all of the rows where Wind Gust > 80, I would type:

In [None]:
# All columns/rows where the wind gusts > 80
weather_observations[weather_observations["Wind gust"] > 80]

Just as with lists and for loops, etc, you can also combine these conditionals using & {and} , or | {or}

In [None]:
#Find the rows where Wind spd < 10 AND Pres > 1020
weather_observations[(weather_observations["Wind spd"] > 60) & (weather_observations.Pres > 1000)]

In [None]:
#You can also get a list of the row indexes for your subset
list(weather_observations[(weather_observations["Wind spd"] > 60) & (weather_observations.Pres > 1000)].index)

Similarly, you can subset using boolean lists

In [None]:
# a boolean list. The 1st, 3rd and 5th values are True.
na = weather_observations["Wind spd"].isnull()

weather_observations[na] #subsets all rows where Wind spd = None

#### Challenge

How many rows exist where null values are concurrently present in the columns "Wind dir", "Fire" and "Wind spd"?

_Hint: Remember that you can combine conditions with '&'_

#### Describing your data

Often you'll also want a quick summary of your data and what's in each column

You can do this with the function `describe()`

In [None]:
weather_observations.describe()

You'll notice that it's excluded the non-numeric columns though.

To get a description of all of your data at once, you need to give `describe` the `include = all` argument

In [None]:
weather_observations.describe(include = 'all')

We can also find specific statistics for each of these columns by using `max()`, `min()`, `count()`, `std()` {standard deviation}, `mean()` and `sum()`. 

Just remember though that many of these functions rely on numeric data types, and will cause errors if used on a str type. Calling `sum()` on a string however will concatentate those strings.

In [None]:
#The maximum value in Fire
weather_observations.Fire.max()

In [None]:
# The number of null values in Fire
weather_observations.Fire.isnull().sum()

In [None]:
#Try finding the maximum of the Location (a string) column
weather_observations['Wind dir'].max()

It's often useful to know exactly what and how many different values you have in categorical columns.

`unique()` works on both numeric and string data, and gives you a list of all of the unique values within a particular column.

In [None]:
weather_observations["Wind dir"].unique()

You can get the size/how many unique values there in two ways:

1) Taking the `len()` of the array  
2) Using the numpy `size` function

In [None]:
print("Len:", len(weather_observations["Wind dir"].unique()))

print("Numpy:", weather_observations["Wind dir"].unique().size)

## Cleaning and Manipulating Data

- Dealing with duplicates
- Sorting data
- using the `apply()` function
    - `reset_index` for row names
- Adding and deleting columns/rows
    - renaming columns
- setting data frequency


If you look at our data you can notice some funny things.

In [None]:
# First 5 and last 5 values around the 01/01/2013
weather_observations["Datetime"].iloc[list(range(5)) + list(range(48,53))]

Not only does it order the time series in the reverse order, but it actually includes two midnight values for the same day - one at the start and one at the end of each day

In [None]:
# Remove duplicated items with the same date and time
no_duplicates = weather_observations.drop_duplicates('Datetime', keep='last')

In [None]:
print(weather_observations.shape)
print(no_duplicates.shape)

no_duplicates.head()

We can also use `sort_values()` to sort our data in chronological order

In [None]:
# Sorting is ascending by default, or chronological order
sorted_dataframe = no_duplicates.sort_values('Datetime')

In [None]:
sorted_dataframe.head()

Similarly, while querying our data, it might be better to have our datetime as our indexes, rather than some arbitrary numbers

In [None]:
# Use `Datetime` as our DataFrame index
indexed_weather_observations = sorted_dataframe.set_index('Datetime')

In [None]:
indexed_weather_observations.head()

Otherwise, if you didn't want to index your data with another column, you can just do `df.reset_index()`

### Adding and deleting columns

#### The `apply()` function

If we wanted to model this data, we would need the `Wind  dir` column to be numeric values, rather than strings.

Thankfully, we can do this using the "apply" function.

Apply will essentially "apply" a function of your choosing to the specified column/s. This could be to create a new column based off of the values of other columns, or, as in this case, processing the data within a single column to become something new.

In this particular instance, each of our wind directions corresponds to an angle:
 * North wind (↓) is 0 degrees, going clockwise ⟳.
 * East wind (←) is 90 degrees.
 * South wind (↑) is 180 degrees
 * West wind (→) is 270 degrees
 * etc

In [None]:
# Translate wind direction to degrees
wind_directions = {
     'N':   0. , 'NNE':  22.5, 'NE':  45. , 'ENE':  67.5 ,
     'E':  90. , 'ESE': 112.5, 'SE': 135. , 'SSE': 157.5 ,
     'S': 180. , 'SSW': 202.5, 'SW': 225. , 'WSW': 247.5 ,
     'W': 270. , 'WNW': 292.5, 'NW': 315. , 'NNW': 337.5 }

If we wanted to create a new column, called "Wind deg", we simply need to assign a new "series" object to a new column in our dataframe

In [None]:
# Create a new wind directions column with a new number column
# `get()` accesses values safely from dictionary

# Create a new column 
indexed_weather_observations['Wind deg'] = \
    indexed_weather_observations['Wind dir'].apply(wind_directions.get) # using these values

In [None]:
indexed_weather_observations.head()

Similarly, we could have just over-written the current values inside `Wind dir` instead of creating a new column.

However, now that we have a new column, we want to delete the old one. 

Just as with a list, we can do this with the command `del`

In [None]:
del indexed_weather_observations["Wind dir"]

In [None]:
indexed_weather_observations.head()

In [None]:
# If you then wanted to reorder your columns
indexed_weather_observations = indexed_weather_observations.iloc[:,[-1]+list(range(10))]

In [None]:
indexed_weather_observations.head()

#### Renaming columns
If I wanted, we could also rename our `Wind deg` column to be `Wind dir` again

In [None]:
indexed_weather_observations =\
            indexed_weather_observations.rename(columns = {'Wind deg':'Wind dir'})

indexed_weather_observations.head()

#### Challenge

Create a new column "E" for our toy dataframe, df. "E" will contain the values from column B + 100. 

_Hint 1: You can test your output by using the "apply" function without assigning it to a new column first_

_Hint 2: you can create your own functions using "lambda". e.g. _`apply(lambda x: ` _`insert stuff with x`_`)`


### Changing Data Frequency

If you delve into your data, you might occasionally see some timestamps that don't follow the typical half-hour format

In [None]:
# One section where the data has weird timestamps ...
indexed_weather_observations[1800:1806]

Another function, `df.asfreq()`, allows you to force a frequency on your index, and will discard/fill in the rest

In [None]:
# Force the index to be every 30 minutes
regular_observations = indexed_weather_observations.asfreq('30min')

print(regular_observations.shape) # we've now deleted ~ 2000 rows

# Same section we observed earlier
regular_observations[1633:1638]

### Dealing with Missing Data

- finding missing data
- interpolating missing data
- data interrogation with the `groupby` function

Often we want to know where and if we have any missing data in our dataset

A rudimentary way of doing this is with `counts`. `counts` will give you the number of **non-null** values within each column. Therefore, if your columns are shorter than the number of rows, you have nulls

In [None]:
regular_observations.count()

So we can see here, none of our columns contain all 17520 values, with the most significant number of nulls being found in "Wind dir"

As seen when applying conditional statements before, there is also the `isnull()` function, which returns a list of True/False values. This can be used to subset your data to find the null rows, or to count how many there are in specific columns

In [None]:
regular_observations["Wind dir"].isnull().sum()

You can also show *every* row with null values at once using the isnull().any() chained operation. any() contains the optional parameter "axis", which allows you to choose whether it operates on the rows or the columns. By default axis = 0, which gives you columns, but setting axis = 1 checks over the rows instead

In [None]:
# will give the total number of rows that contain nulls
regular_observations.isnull().any(axis = 1)

In [None]:
# Will give True/False values for whether a column contains nulls
regular_observations.isnull().any(axis = 0)

Similarly, if we plot our data we can see that we have a significant portion of data missing

In [None]:
# Make the graphs a bit prettier
# pd.set_option('display.mpl_style', 'default') 
# plt.rcParams['figure.figsize'] = (18, 5)

regular_observations[['Wind spd', 'Wind gust', 'Tmp', 'Feels like']][:500].plot()

As we were seeing while tracking the null values, we can see that there are a number of values missing even just in January

This can be due to errors in the sensors, or because of maintainence, etc. However if we wanted to model all of this data, this data needs to be complete

#### Interpolating Missing Data

One option is to use the function `Series.interpolate` to predict and fill in the missing values based on the index

If we examine the function more closely, we can see that it can take multiple arguments, including the prediction method (default = linear), and the direction in which it can replace your data

In [None]:
help(df.interpolate)

In [None]:
# Some of the null values    
regular_observations[1633:1638]

In [None]:
# Interpolate data to fill empty values
for column in regular_observations.columns:
    regular_observations[column].interpolate('time', inplace=True)

In [None]:
# The null values have been replaced    
regular_observations[1633:1638]

In [None]:
# Plot it again - to be sure
regular_observations[['Wind spd', 'Wind gust', 'Tmp', 'Feels like']][:500].plot()

No gaps!

If you didn't want to predict values, there are a few other options you have to resolve nulls.

One option, provided you don't require a consecutive sequence of data, is to delete your null observations.

This can be easily done using the `dropna()` function

`dropna()` can work on either the rows or the columns, and allows you to specify whether you want to remove rows/column where either 'any' or 'all' of the data are nulls. Alternatively, you can set a threshold value, where rows/columns are discarded if they don't have at least a threshold # of non-null values

In [None]:
temp_df = indexed_weather_observations.dropna(axis = 0, thresh= 5)

In [None]:
print(indexed_weather_observations.shape)

temp_df.shape #As we can see, only 11 rows had < 5 values in their row

Otherwise, we could replace all the null values in each column with an appropriate value. While this wouldn't be appropriate for this dataset, there are many where a default value would be appropriate. You can do this with a method called `fillna()`

In [None]:
df

In [None]:
df["E"] = df["E"].fillna(value = 0)

df

`fillna` also allows you to forward or backfill your dataframe with the "methods" argument.

In [None]:
df["B"] = df["B"].fillna(method = 'ffill')
df

In [None]:
df["D"] = df["D"].fillna(method = 'bfill')
df

### Grouping Data

It's often useful to be able to group the data within a column according to the data in another column. 

For example, you might wish to take the mean temperature for each month, or each year. We could do this in a complicated and time consuming way where you subset the data for each month, and then take the mean of the Tmp column...or we could just use the `groupby` function. `groupby` outputs a reformatted version of your data where all values associated with your "grouping factor" are taken together. You can then perform basic statistical tests (or plot) on this grouped output, and the tests will be performed within each of the "groups" you defined.

Going back to the previous example, say that we wanted to find the mean temperature seen in each month:

In [None]:
# we use TimeGrouper because we're grouping based on datetime data
regular_observations[["Tmp","Feels like"]].groupby(pd.TimeGrouper(freq='M')).mean()

You can also choose to group over multiple factors, such as by Month and Year together, by specifying a list for the `by` parameter within `groupby()`. The order of the values in this list matters, as `groupby` will first group your data based on the first factor, and THEN the second factor.

In [None]:
# Grouping by month, then by day
#The average number of pedestrians 
regular_observations[["Wind spd", "Wind gust"]].groupby(by=\
                                            [pd.TimeGrouper(freq='M'),pd.TimeGrouper(freq='D')]).mean()

You can also chain together more commands, like max(), etc, to find more specific data

In [None]:
# The maximum average monthly temperature 
regular_observations['Tmp'].groupby(by=pd.TimeGrouper("M")).mean().max()

In [None]:
# If i wanted to find which month that belonged to though....
max_temp = regular_observations['Tmp'].groupby(by=pd.TimeGrouper("M")).mean().max()

regular_observations["Tmp"].groupby(by=pd.TimeGrouper("M")).mean()[\
                                    regular_observations["Tmp"].groupby(by=pd.TimeGrouper("M")).mean() == max_temp]

#### Challenge

Which month has the greatest standard deviation in temperature?

## Plotting Data

Pandas has some in-built plotting functionality, that builds off of the traditional matplotlib library

We've already seen a bit of this with "plot"

In [None]:
regular_observations[['Wind spd', 'Wind gust', 'Tmp', 'Feels like']][:500].plot()

In [None]:
regular_observations[:500].plot(x = "Feels like", y = "Tmp")

In [None]:
 from pandas.plotting import scatter_matrix

In [None]:
 scatter_matrix(regular_observations[:500], alpha=0.2, diagonal='kde')

In [None]:
from pandas.plotting import andrews_curves

plt.figure()
andrews_curves(regular_observations[:500], 'Wind dir')

In [None]:
from pandas.plotting import parallel_coordinates

plt.figure()

parallel_coordinates(regular_observations[:500], 'Name')

You can see exampes of these and many more inside the pandas documentation, https://pandas.pydata.org/pandas-docs/stable/visualization.html

## Combining Datasets

We also have another dataset, called "Canberra Sky". 

#### Challenge

Read in the Canberra Sky file, as `sky_observations`, in an appropriate format

Looking at the "head" of the data, tt seems to have the same issues that our previous dataset originally had.

In [None]:
# As before, remove duplicates and set index to datetime.
sky_observations.drop_duplicates('Datetime', keep='last', inplace=True)
sky_observations.sort_values('Datetime', inplace=True)
sky_observations.set_index('Datetime', inplace=True)

sky_observations.head()

#### Challenge

Remove all of the rows in `sky_observations` that have no data in them. i.e. all observations for that timestamp are NaN

_Hint: remember the `dropna` function?_

Ok, now let's explore our data!

In [None]:
# Display the inferred data types
sky_observations.dtypes

A closer look at Cloud shows that, rather than restricting the column values to using a scale between 1 - 8, people have been typing in a fractional string 

In [None]:
# Display the inferred data types
sky_observations.Cloud.unique()

In the Cloud column, 'clear' means no clouds, and 'nan' represents "obscured", meaning that the visibility was too low to see cloud. In this case, we would take that as a null value

To convert this column into a numerical value, we can define our own function and then use `apply` just as we did before

In [None]:
# Define a function to Change the 'Cloud' column to numerical values
def cloud_to_numeric(s):
    if s == 'clear' or pd.isnull(s):
        return 0
    else:
        # s[0] is the first str character, or the numeric value
        return int(s[0]) / 8.0 # because we want a scale between 0 and 1

In [None]:
# Apply the function to every item and assign it back to the original dataframe
sky_observations['Cloud'] = \
    sky_observations['Cloud'].apply(cloud_to_numeric, convert_dtype=False).astype('float64')
sky_observations.head()

We now have a cleaned dataset for `sky_observations` that only contain numeric values. 

If we wanted to combine this with our previous `regular_observations` dataset though, we can see that despite `sky_observations` being indexed over the same time period as `regular_observations`, we had to remove many more of the null values, and it now has far fewer rows.



In [None]:
print(sky_observations.shape)

print(regular_observations.shape)

#### The `combine` function
Therefore if we wanted to add these two datasets together, we couldn't use the simple Column addition that we learned earlier.

To combine these wo datasets, we can therefore use the function `df1.combine(df2)`

In this particular instance, we only want to join the Cloud column to our `regular_observations` dataset

In [None]:
# Join the two observations together
# combined_observations = regular_observations.combine_first(sky_observations[['Cloud']])
combined_observations.head()
# combined_observations[combined_observations.Cloud.notnull()][:10]

We have a lot of null values, but it joined it all together

#### Challenge

Try joining another column from `sky_observations` onto the `combined_observations` dataframe. Which column/s would be the most appropriate? Which would be the least? Why?

## Summary

This session has taken you through the beginner's guide of how to visualise, clean and investigate your data, and given you the basic tools to query your own research data. 

Python is a very versatile tool, and can go much further than what we've shown you today. Many packages are open source and being developed for a range of purpoess all the time. Scipy allows you to perform basic math and statistical tests for example, there are a host of packages related to investigating biological data.

Using a programming language allows you to investigate much larger data than you can do by hand or by eye, and the customisability of the plotting tools makes it ideal for generating useful and professional images  ideal for publication.

Follow us on Twitter to keep up-to-date on new and up-coming trainings
- [@kflekac](https://twitter.com/kflekac)
- [@ResBaz](https://twitter.com/resbaz)
- [@ResPlat](https://twitter.com/resplat)

There's also a reasonably new facebook group, called ["Data Wrangling with Python"](https://www.facebook.com/groups/797677037064561/) where you can post your python problems, cool things you've done, and stay apprised of new trainings coming up as well.

Also remember that if you're having problems, Research Platforms runs a weekly Hacky Hour, where anybody and everybody is able to enquire about coding and programming problems in a variety of tools. Every Thursday from 3-4pm, at the large table in Tsubu Bar.