# Going down to the cellular level: Indices

So far we learned how to retrieve a column, now we can learn how to do that for a row. Remember that, if formatted according to conventions, a row codes for a single observation. A single observation, we have seen, may be described in terms of many variables. Each cell in that row encodes the value for the corresponding column's variable.

Rows are numbered, or as it's called: *indexed*. It is a quirk in computing that row numbers start with 0, not 1. The indices of a dataframe are stored as an object property and are found by attaching *.index* to the dataframe variable.

| Attribute0        | Attribute1           | Attribute2  |
| ------------- |:-------------:| -----:|
| Row 0, Column 0      | Row 0, Column 1 | Row 0, Column 2 |
| Row 1, Column 0      | Row 1, Column 1      |   Row 1, Column 2 |
| Row 2, Coumn 0 | Row 2, Coumn 1 |   Row 2, Column 2 |
 

In [2]:
import pandas as pd
df = pd.read_csv("traffic_data_glasgow.csv", sep=',')

df.index

FileNotFoundError: File b'traffic_data_glasgow.csv' does not exist

Remember that you can refer to a range of integers - for example, all those between 0 inclusive and 18 - by using a colon such as *0:18*.

To find the first two rows will make use of the following functions:

1. loc[]: retrieves rows based on labels for rows and/or columns. For example, df.loc[0:1]
2. iloc[]: retrieval is based on integer positions. For example, df.iloc[0:1]

What's the index in our particular case? Let's check:

In [None]:
print (df.loc[0:1])
print ("--------------------------")
print (df.iloc[0:1])

Dealing with row numbers feels a bit inconvenient however. It would be more natural to refer to rows using any unique identifier attribute, such as id or year. That way we could get all the rows encoding data between 2007 and 2010 without having to convert between row numbers and years.


In [3]:
#set the index to be equal to the year column
df = df.set_index('Year')
print (df.head())

NameError: name 'df' is not defined

# Your turn

1. Get rows between 2010 and 2012 (including 2012)
2. In addition, to getting the entries between 2010 and 2012, include only the columns from Pedal Cycles to Cars
3. Find the min number of motorcyles between 2000 and 2005 (Hint: you should get 4174 as a min number of motorcycles between 2000 and 2005)

In [4]:
# entries between 2010 and 2012
print (df.FUNCTION[row_range])

#entries between 2010 and 2012
print (df.FUNCTION[row_range,column_range])

#min temperature for 2000
print (df.loc[row_range, column].FUNCTION)

NameError: name 'df' is not defined

# Ignoring the irrelevant: filters
In our analysis often it's useful to know which entries satisfy a condition. For example, we might want to only look at dates with more than 120 count points. 

How to do that? Between the square brackets we need to put an expression that shows the condition that the value in the cell must satisfy. This is done using the logical operators ==, <, >, =<, => and !=.


In [5]:
dfCounting = df[df['Count points']>120]
#breaking it down: 
# df['Count points']>120 returns TRUE/FALSE values depending whether the condition has been satisfied
# However, we don't want to see when there were more than 120 count points and what were the values recorded and pandas to that for us. 
print (dfCounting.head())

NameError: name 'df' is not defined

# Your turn

1. Retrieve the entries for which the count points are exactly 118 (HINT: use == for your condition)
2. Retrieve the entries for which the number of motorcycles is less than 4400

In [6]:
#retrieveing entries with exactly 118 count points
dfPoints = df[df[COLUMN_NAME]CONDITION]

#entries with less than 4400 motorcycles counted
dfMotorcycles = ...

print(dfPoints.head())

SyntaxError: invalid syntax (<ipython-input-6-6197d464da63>, line 2)

# Time to visualise: Linegraphs

It is hard, if not impossible, to make a sensible visualisation out of data from 10 different columns, so as we begin to visualise it, we initially settle for one, for example time. The traffic data we deal with is a *time series* where data points are indexed in time order at equally spaced intervals. Line charts are often the most common way of visualising time series. You have probably seen them in the news or in a PowerPoint presentation not long ago, as humans are interested in understanding the past and using it to to predict the future.

To do this we must first import a library specialised for making plots, called **matplotlib**. The exact syntax for how to string a line graph together can be overwhelming at first, but easy enough to tweak.



In [7]:
import numpy as np
import matplotlib.pyplot as plt #will use it to create charts
import warnings; warnings.simplefilter('ignore')

#X axis: plotting the time range
x = df.index
# y: number of vehicles
y = df['All Motor Vehicles']
y1 = df['Cars']
plt.plot(x,y, label='all')
plt.plot(x,y1,label='cars')

plt.xticks(np.arange(min(df.index), max(df.index)+1, 4))
#code to set title and labels
plt.title("Traffic measurements between 2000 and 2017")
plt.xlabel('Year')
plt.ylabel("Count")
plt.legend(loc='upper left')
plt.show()


NameError: name 'df' is not defined

# Your turn

Create a line chart for 2 fields of your choice. You can also select a year range if you prefer.

In [1]:
#X axis: plotting the time range
x = df.index
# y: number of vehicles
y = df[COLUMN NAME]
y1 = df[COLUMN NAME]
plt.plot(x,y, label= SENSIBLE LEGEND LABEL)
plt.plot(x,y1,label= SENSIBLE LEGEND LABEL)
plt.xticks(np.arange(min(df.index), max(df.index)+1, 4))

#code to set title and labels (see previous example)
plt.title(...)
plt.
...
plt.show()

SyntaxError: invalid syntax (<ipython-input-1-c81cbc29ba9e>, line 4)