# Module 3: Loading & Manipulating Data

In this module you'll learn about:

- Importing libraries
- File paths
- Opening text files from your computer
- Loading CSV files and iterating over rows
- Loading CSV into a Pandas DataFrame
- Exploring a DataFrame (including some descriptive statistics)
- Filters / subsets of DataFrames
- Editing DataFrames
- Saving DataFrames as CSV files


## Libraries

Ready for some great Python news? You don't have to code everything by yourself from scratch! Many other people have written Python code that you can import into your own code, which will save you time and do a lot of work behind-the-scenes. We call the code written and packaged up by other people a "library", "package", or "module", although here we will just refer to them as libraries. 

The `import` statement is used whenever you want to import an external Python library that was written by someone else. Let's give that a go with the `os` library, which stands for "Operating System", and allows us to interact with other files on your computer.

In [None]:
# import the os library
import os

Now that we have `os` imported, we can use it. To call any of the functions within a library, you just type the library name followed by a full stop and the function name. The `os` library has a function called `getcwd()` which returns the current working directory (cwd), which is the place your current script is running. 

In [None]:
# get the cwd and print it
cwd = os.getcwd()
print(cwd)

That should show you the *path* where you've saved your scripts. 

## Paths

Just a few notes on paths, as this will be useful to know when we start reading files from your computer. A path is the formal definition of the location of a file or folder on your harddrive. It basically shows which harddrive the file/folder is stored, together with the folders (also called directories) you need to traverse to get to a file/folder. This combination of folders showing you how to get to a file is called a *path*.

If you are using Windows, the above cell will output something like: "C:\Users\your_username\Documents\Scripts", where "C:\" is the harddrive it's stored on, and "Users\your_username\Documents\Scripts" is the location within that harddrive. Note that the path uses backward slashes (`\`).

If you are running MacOS or Linux, the above cell will output something like: "/home/your_username/Scripts". Note that the path uses forward slashes (`/`). 

This whole path from the root folder to the folder where the file is kept, is called an *absolute path*. It is always the same, no matter where your script is. The other way we can find a file is via a *relative path*. This is relative to where your script is, and so if you want to open a file that is in the same folder as your script, you don't need to type the whole path, but can just use "test.txt". As Python 'knows' where it is, it knows where to find the test.txt file. 

A relative path can also be in a subfolder of a current folder, in which case it looks something like "data/test.txt". A special syntax (`../`) is used for going up a folder, like using the back button in your file explorer: "../test.txt", this will fetch the text.txt file in the folder above your script.

You don't need to remember all this right now, but it's useful to know of these concepts when working with your own data.


## Opening text files

Quite often, research data is in the form of a text file (or can easily be converted to a text file). A text file is not only files that have the extension ".txt", but basically any file that can be opened with Notepad (or TextEdit on Mac). This includes CSV (Comma Separated Values) files, which we'll come back to in the next section. 

To open a file in Python, we use the `open()` function, which requires some specific syntax. 

In [None]:
# open the text.txt file from the data folder, save the file under variable name 'file'
with open('../data/text.txt') as file:
    
    # read the text within the file, and save it in the variable 'text'
    text = file.read()

# print the text
print(text)

Did you see in the `open()` function above, we used `../` to go up a folder, the into the data folder, and then the text.txt file? This is a relative path!

So above we opened a text file and saved all the text as a string, which we saved in the `text` variable. This can be useful, but quite often with research data, you will find one entry, or one thing, per line. So it would be more useful to have each line as an item in a list. To do this, we use the `readlines()` function, instead of `read()`, like so:



In [None]:
# open the text.txt file from the data folder, save the file under variable name 'file'
with open('../data/text.txt') as file:
    
    # read the text within the file, and save it in the variable 'lines'
    text = file.read()
    
    # take the string in `text`, and split the lines, converting it to a list where every line is 1 item
    lines = text.splitlines()

# print it!
print(lines)

Now we have each line as an item in a list, we can loop over it using a `for` loop, like we did in the previous module. Before you run the cell below, read the code, and try to predict what it will do:

In [None]:
for line in lines:
    print(len(line))

Did that match your expectation?

## CSV files

CSV (short for Comma Separated Values, .csv) is a specific format for text files. Like the name suggests, it uses Commas to Separate Values, and this way we can store spreadsheet like information. This format is quite simple, which means it can easily be read by Excel, but also Python. It is often used as a way to transfer information between software packages, and also for long term storage. 

A CSV file is just a text file, so when you would open one in Notepad, it would look something like this:

    width,height,weight
    2,3,5
    5,9,8

Here we see a spreadsheet with 3 columns (width, height, weight), and 2 rows containing values. Each column is separated by a comma. As a .csv file is really just a text file, we can open it the same way as we did the text file above:

In [None]:
# open the test.csv file from the data folder, save the file under variable name 'file'
with open('../data/test.csv') as file:
    
    # read the text within the file, and save it in the variable 'lines'
    text = file.read()
    
    # take the string in `text`, and split the lines, converting it to a list where every line is 1 item
    lines = text.splitlines()

# print it!
print(lines)

This works, but isn't super handy, because we now need to convert each line from a string (e.g. `'2,3,5'`) to a list (e.g. `[2, 3, 5]`). Instead, we can load and use the `csv` library, which has functions built in for reading CSV files:

In [None]:
# import csv library
import csv

# open the test.csv file 
with open('../data/test.csv') as file:
    
    # read the file with the csv.reader function, which returns a list of rows
    rows = csv.reader(file)
    
    # for each row
    for row in rows:
        
        # print the row
        print(row)


Now say we want to calculate the width/height ration for all our rows, and save these in a new list. This can be done like this:

In [None]:
# set up empty list to hold all the ratios
ratios = []

# open the test.csv file 
with open('../data/test.csv') as file:
    
    # read the file with the csv.reader function, which returns a list of rows
    rows = csv.reader(file)
    
    # for each row (reuse the row variable from previous cell, it still contains all the data!)
    for row in rows:

        # we don't want to do anything in the first row, as this contains the headers. 
        # we skip it by checking if the first column of the row is not 'width' ("!=" means 'is not')
        if row[0] != 'width':

            # calculate the ratio (0 = width, 1 - height)
            ratio = float(row[0]) / float(row[1])

            # add ratio to the list
            ratios.append(ratio)
        
# print ratios
print(ratios)

The code above works, but it's a lot of code for something relatively simple, and you have to do a lot of stuff manually (loading the file, looping through rows, skipping headers). If your data looks like a spreadsheet, it is generally a lot easier to import the data into a so-called DataFrame. 

## Pandas DataFrames

A DataFrame is a 2-dimensional data structure, it is a type of variable just like lists and strings. 2-dimensional might sound complex, but really it just means that it contains data in rows and columns, just like your average spreadsheet. DataFrames can also do just about anything you can do in Excel, plus some extra stuff! A lot of our archaeological data is in this format, so if you start coding your own projects, it is likely that DataFrames will make your life easier. 

DataFrames are not part of the default Python installation, so we need to import the library first, just like we did with `os` and `csv`. The library that contains DataFrames is called `pandas`. Let's import it now:

In [None]:
import pandas as pd

The above `import` statement not only imports the Pandas library but also gives it an alias or nickname — `pd`. This alias will save us from having to type out the entire words pandas each time we need to use it. Many Python libraries have commonly used aliases like pd. 

### Read in CSV File

To read in a CSV file, we will use the function `pd.read_csv()` and insert the name of our desired file path.

In [None]:
test_df = pd.read_csv('../data/test.csv', delimiter=',')

This creates a Pandas DataFrame object — often abbreviated as df, e.g., `test_df`. A DataFrame looks and acts a lot like a spreadsheet. But it has special powers and functions that we will discuss below.

When reading in the CSV file, we also specified delimiter. The delimiter specifies the character that separates or "delimits" the columns in our dataset. For CSV files, the delimiter will most often be a comma. (CSV is short for Comma Separated Values.) Sometimes, however, the delimiter of a CSV file might be a tab (\t) or, more rarely, another character. Always inspect your data before loading to see which delimiter is used!

### Display a DataFrame

To see what data is stored in a DataFrame, you can use the print() function, like with any other variable:

In [None]:
print(test_df)

You can see the tabular structure of the data very well, and this is much easier to read than when we loaded the CSV manually. 

An even better way to view DataFrames is by simply typing the variable name into a cell. Jupyter Notebook will recognise the variable as a DataFrame, and apply some styling to make the data even easier to read:

In [None]:
test_df

For such a small table this doesn't matter that much, so let's load a bigger spreadsheet:

In [None]:
# load artefacts file
artefacts = pd.read_csv('../data/artefacts.csv', delimiter=',')

# show data
artefacts

You might recognise this data, it's also used in the BA1 Data Analysis course. We will repeat some of the analyses and visualisations from that course, to show you how easy and flexible it is to do it in Python instead of Excel!

This spreadsheet contains data from a field survey in the Italian research area Agro Pontino, just south of Rome. Each row is an observation (record) and describes a single artefact as found during the field survey. Artefacts that were discovered during a visit were numbered sequentially. Normally several artefacts are found on a arable field and several adjacent fields might together form one archaeological site. An arable field is often revisited several times to collect more artefacts in order to get a better picture of the site in size, function and date.

Most of the variables used are categorical data (measurement level) with coded (numerical) values. These categorical data are particularly suitable for data analysis.  

There are a few important things to note about the DataFrame displayed above:

- Index
    - The bolded ascending numbers in the very left-hand column of the DataFrame is called the Pandas Index. You can select rows based on the Index.
    - This is similar to the row numbers in Excel
    - By default, the Index is a sequence of numbers starting with zero. However, you can change the Index to something else, such as one of the columns in your dataset.

- Rows x Columns

    - Pandas reports how many rows and columns are in this dataset at the bottom of the output (606 x 12 columns).
    - This is very useful!

- Truncation

    - The DataFrame is truncated, signaled by the ellipses (...) in the middle of the rows.
    - The DataFrame is truncated because the default display setting is to show 10 rows. Anything more than 10 rows will be truncated. To display all the rows, we can alter the setting:


In [None]:
# set max rows to display to 1000, more than the total of rows in the dataframe
pd.set_option('display.max_rows', 1000)

# show data again
artefacts

### Display Sections of a DataFrame

To look at the first *n* rows in a DataFrame, we can use a method called `.head()`. 

Note, *n* is used in math/programming to denote any arbitrary number. In the below cell, *n* = 20 for example.

In [None]:
# show top 20 rows
artefacts.head(20)

To look at the last *n* rows in a DataFrame, we can use a method called `.tail()`.

In [None]:
# show bottom 40 rows
artefacts.tail(40)

If you're more interested in seeing some rows througout your data, you can get a random sample by using `.sample()`:

In [None]:
artefacts.sample(15)

You can tell this is a random sample by looking at the index numbers on the left: they are not sequential or sorted anymore! Run the above cell again to see the rows change, everytime you run `.sample()` you get a different, random sample.

### Get Info and Statistics

To get useful info about all the columns in the DataFrame, we can use `.info()`.

In [None]:
artefacts.info()

This report will tell us how many non-null, or non-blank, values are in each column, as well as what type of data is in each column. Unfortunately, Pandas have slightly different name for variable types, but here's a translation table:

| Pandas Data Type | Explanation |
|:----------------:|:-----------:|
|      object      |    string   |
|      float64     |    float    |
|       int64      |   integer   |
|    datetime64    |  date time  |

To calculate descriptive (summary) statistics for every column in our DataFrame, we can use the `.describe()` method.

In [None]:
artefacts.describe()

That's certainly a lot easier and quicker than calculating all these values by using Excel formulas! However, these columns mainly contain nominal and ordinal data (categories and counting numbers). 

#### Exercise time!

In the cells below, do the following:

- Load the 'spearheads.csv' file from the data folder into a DataFrame, name the variable `spearheads`
- Show and inspect the resulting DataFrame
- Question: how many rows are in the DataFrame? (Take into account the index starts at 0!)

In [None]:
## EXERCISE ##



Part 2 of the exercise:

- Calculate and show the descriptive statistics for `spearheads`
- Question: does the 'count' match your answer above?
- Question: what is the maximum value for 'weight'?
- Question: what is the average date?

In [None]:
## EXERCISE ##



### Selecting Columns

To select a column from the DataFrame, we will type the name of the DataFrame followed by square brackets and a column name in quotations marks, just like we do when selecting an element from a list or dictionary!

In [None]:
# select and show column 'con'
spearheads['con']

Technically, a single column in a DataFrame is a Series object. We can ask Python to tell us the type of a variable by using the `type()` function:

In [None]:
type(spearheads['con'])

Mini-exercise! Check the type of the following variables. Before running the `type()` function, try and predict what type each variable is. Did your expectation match the output?

In [None]:
# define variables
site = 'Ur'
weight = 123.5
number_of_sites = 5
types = ['flint','bone']
indy = {'name': 'Indiana Jones'}
gold_found = True

# check the type of these variables below


A Series object displays differently than a DataFrame object. To select a column as a DataFrame and not as a Series object, we will use two square brackets.

In [None]:
spearheads[['con']]

By using two square brackets, we can also select multiple columns at the same time.

In [None]:
spearheads[['num', 'mat', 'con']]

Once you've defined a subset of data, you can get the descriptive statistics for just those columns, again using `describe()`

In [None]:
spearheads[['weight', 'maxle', 'maxwi']].describe()

It is also possible to calculate specific statistics for just one column:

In [None]:
spearheads["maxle"].mean()

Or for a couple of columns (note the double square brackets!):

In [None]:
spearheads[["maxle", "maxwi"]].median()

Note - we've not used `print()` in the last couple of cells, as Jupyter Notebook automatically prints the output of Pandas functions and variables. However, it will only do this *1 time per cell*. Specifically, it will print the last line of code. So in the following cell, you'll only see the max, not the min.

In [None]:
spearheads["weight"].min()
spearheads["weight"].max()

To print multiple things in one cell, you will still need to use `print()`

In [None]:
print(spearheads["weight"].min())
print(spearheads["weight"].max())

For reference, here's a table with all the statistics that are available by default in Pandas.

|  Function |            Description           |
|:---------:|:--------------------------------:|
|  count()  | Number of non-null observations  |
|   sum()   | Sum of values                    |
|   mean()  | Mean of Values                   |
|  median() | Median of Values                 |
|   mode()  | Mode of values                   |
|   std()   | Standard Deviation of the Values |
|   min()   | Minimum Value                    |
|   max()   | Maximum Value                    |
|   abs()   | Absolute Value                   |
|   prod()  | Product of Values                |
|  cumsum() | Cumulative Sum                   |
| cumprod() | Cumulative Product               |

### Filtering by column value

Besides selecting certain columns, we can also select certain rows, based on a value in that row. Filtering data by certain values is similar to selecting columns.

We type the name of the DataFrame followed by square brackets and then, instead of inserting a column name, we insert a True/False condition (like those in `if` statements!). For example, to select only rows containing bronze spearheads, we need to filter on value "1" in the "mat" column. So the True/False condition would be `spearheads['mat'] == 1`. The total code looks like:

In [None]:
bronze_spearheads = spearheads[ spearheads['mat'] == 1 ]
bronze_spearheads

Now we have the subset of bronze spearheads, we can calculate statistics on just those rows, just like we did with columns:

In [None]:
bronze_spearheads['weight'].mean()

Now let's compare iron vs. bronze, see which is heavier on average:

In [None]:
iron_spearheads = spearheads[ spearheads['mat'] == 2 ]
iron_spearheads['weight'].mean()

Looks like iron spearheads are heavier on average!

### Grouping by columns

Do you remember pivot tables from Excel? Pandas can do those too! But in Pandas-speak, it's called grouping. The Pandas function `.groupby()` allows us to group data and perform calculations on the groups.

For example, we might want to see the differences between iron and bronze spearheads. Instead of making 2 subsets (1 of iron, 1 of bronze), we can use grouping to make a table to summarise the differences.

The first step to using groupby is to type the name of the DataFrame followed by `.groupby()` with the column we'd like to group on, such as "mat". Once it is grouped, you can use the same functions we used on whole columns/rows like in the previous cells, so using the square brackets with column name(s) and then apply the function. Below, we group the data by "mat" then count how many rows are in each group:

In [None]:
spearheads.groupby("mat")["mat"].count()

Ok, so there's 20 bronze spearheads, and 18 iron ones. Let's double check the weight per material again, but this time using grouping:

In [None]:
spearheads.groupby("mat")["weight"].mean()

That's a lot shorter than manually selecting the columns and calculating the average separately! 

We can also get a statistic for every column, grouped by material. In that case we don't select a column with square brackets, but go straight to the `mean()` function after the `groupby()`:

In [None]:
spearheads.groupby("mat").mean()

And just like Excel pivot tables, we can also select 2 columns we'd like to group by. This allows us to compare a statistic (in this case the mean) across 2 variables at the same time:

In [None]:
spearheads.groupby(["mat", "cond"])["weight"].mean()


So we see that for bronze (mat 1), the better the condition (cond), the higher the weight, but this doesn't hold for iron... 

### Editing DataFrames

So far, we've only created subsets of data and calculated statistics, but the data has stayed the same. Of course, just like in Excel, we can also edit DataFrames. As the column headers aren't very informative (and a bit confusing!), let's update them to be more clear:

In [None]:
spearheads.rename(columns={
    'num': 'number', 
    'mat': 'material', 
    'con': 'context', 
    'loo': 'has_loop', 
    'peg': 'has_peghole', 
    'cond': 'condition', 
    'maxle': 'max_length', 
    'socle': 'socket_length', 
    'maxwi': 'max_width', 
    'upsoc': 'upper_socket_width', 
    'losoc': 'lower_socket_width', 
    'mawit': 'maxwidth_lowersocket_distance'
}, inplace=True)

Instead of listing all the columns in one line of code, they are broken up, 1 per line, which aids readability. Also note the `inplace=True` option, this means that the changes should be done in the existing DataFrame, instead of returning a copy. If we now inspect the DataFrame, we'll see the updated column names:

In [None]:
spearheads.head() # by default, .head() shows the top 5 rows

If we don't need all the columns for our analysis, we can delete some with the `drop()` function:

In [None]:
spearheads.drop(columns='maxwidth_lowersocket_distance', inplace=True)
spearheads.tail() # by default, .tail() shows the last 5 rows

And similarily, we can remove rows by putting the row's index in the `drop()` function. Let's imagine the last row is incorrect, and we want to delete it.

In [None]:
# drop the last row (index of that row is 37)
spearheads.drop(37, inplace=True)

# show updated tail
spearheads.tail()

Another unclear aspect of this data is the use of codes for categories, e.g. using '1' for bronze and '2' for iron. Let's update that using the `replace()` function!

In [None]:
# update category numbers to strings
spearheads['material'] = spearheads['material'].replace(1, 'Bronze')
spearheads['material'] = spearheads['material'].replace(2, 'Iron')

# show random sample to check if it worked
spearheads.sample(5) 

What we did above is take the material column `spearheads['material']`, replace the number by a string `.replace(1, 'Bronze')`, and then save it back in the column by using `spearheads['material'] =`. 

Similarily, we can also combine information from multiple columns and save it in a new column. To create a new column, just type the DataFrame name and put the new column name in square brackets after it, just like you would do when creating a new key/value in a dictionary! For example, if we want to add the length/width ratio (or how elongated it is) of each spearhead, we can do this:

In [None]:
spearheads['length_width_ratio'] = spearheads['max_length'] / spearheads['max_width']
spearheads.head()

Now we have the length/width ratios, let's sort the dataframe on that column, so the most elongated spearheads are at the top:

In [None]:
# sort spearheads by l/w ratio
spearheads.sort_values(by='length_width_ratio', inplace=True, ascending=False) # use ascending=False to reverse the order

# show updated df
spearheads

You can tell the order of the rows has changed by looking at the index, it's now not starting at 0 and in order anymore!

Another example of updating a column: if we want to update the `has_loop` column from using 1 = no, 2 = yes to using 0 = no, 1 = yes (the default way of storing Boolean true or false data), we can simply substract 1 from every number in the column:

In [None]:
spearheads['has_loop'] = spearheads['has_loop'] - 1
spearheads.tail()

Mini-exercise!

In the cell below, do the following:

- Update `has_peghole` to use 1s and 0s, like we did with `has_loop`, so the data stays consistent
- Use `head()`, `tail()` or `sample()` to check the data you've updated

In [None]:
## EXERCISE ##


### Write to CSV

We've updated our data, but currently it's only stored in Python's memory. If you were to close this browser tab, the data ceases to exist (although you could very easily recreate it by running this code again, that's the beauty of code!). To save the data permanently, we'll output it to a new CSV file. We can use the `.to_csv()` method with a name for the file in quotation marks.

In addition to a filename, we're also specifying that the Index (the bolded left-most column) is not included in the CSV file.

In [None]:
spearheads.to_csv("updated_spearheads.csv", index=False)

Now check the modules folder where this script is stored, the CSV file should be there. Open it in Excel (or LibreOffice Calc) and look through the data, is everything as you expected it?


## Next module preview: graphs!

In the next module, we'll learn about graphs and plots, here's a sneak preview, showing how easy it is to make these:

In [None]:
spearheads.plot(kind='scatter', x="max_length", y="weight")

## Final Exercise

Time to practice all you've learned in this module! In the cell(s) below, do the following steps. Note you can use as many cells as you want, this can be handy when displaying DataFrames (remember it only displays the last line of code?). Jupyter Notebook automatically adds another cell when you run the last one in the notebook.

- Load the artefacts.csv file from the 'data' folder into a DataFrame and save it as a variable 
- Calculate and display the average number of artefacts found (mean of ART column)
- Calculate and display the sum of all artefacts found (sum of ART column)
- Make a subset of the DataFrame containing only the columns ART and MAT. Save it in a new variable and display it
- Display the subset of rows where the material type (MAT) is ceramic (code '5')
- Get the mean of ART, grouped by MAT
- Change the column headings 'MAT' to 'material' and 'ART' to 'number_of_artefacts'
- Change the material category numbers '1' and '5' to 'flint' and 'ceramic' respectively
- Remove all rows where 'material' is 99 (which stands for 'unknown')
- Sort the DataFrame by YCOORD
- Write the updated DataFrame to a CSV file with a name of your choosing
- Open the CSV file in Excel to check
- BONUS: make a scatterplot of XCOORD and YCOORD


In [None]:
## EXERCISE ##
