## Pandas and CSV/EXCEL

#### Libraries - Imports

In [1]:
# pip install pandas

In [2]:
 # This is a convention in Python community.

In [3]:
import os
import pandas as pd #conventional

#### Read

Download the csv of [Roman Amphitheaters](https://github.com/roman-amphitheaters/roman-amphitheaters/blob/main/roman-amphitheaters.csv)

In [4]:
 # df stands for dataframe, i.e., row and columns of data.

In [5]:
df = pd.read_csv("roman-amphitheaters.csv")

In [6]:
# df

In [7]:
df.head(10) #first x rows
df.tail(5) # last x rows

Unnamed: 0,id,title,label,latintoponym,pleiades,welchid,golvinid,buildingtype,chronogroup,secondcentury,...,amphitheatrum,dimensionsunknown,arenamajor,arenaminor,extmajor,extminor,exteriorheight,longitude,latitude,elevation
270,saintGeorgesDuBoisAmphitheater,Amphitheater at Saint-Georges-du-Bois,Saint-Georges-du-Bois,,,,103.0,amphitheater,first-century,True,...,https://amphi-theatrum.de/1529.html,False,54.0,30.0,65.0,50.0,,-0.749919,46.142723,39
271,toledoAmphitheater,Amphitheater at Toledo,Toledo,Toletum,https://pleiades.stoa.org/places/266066,,,amphitheater,imperial,True,...,https://amphi-theatrum.de/3090.html,True,,,,,,-4.022888,39.865349,482
272,kaiseraugustAmphitheater,Amphitheater at Kaiseraugst,Kaiseraugst,Castrum Rauracense,https://pleiades.stoa.org/places/81716458,,,amphitheater,fourth-century,False,...,https://amphi-theatrum.de/3066.html,False,,,50.0,40.0,,7.721596,47.540822,482
273,ammaiaAmphitheater,Amphitheater at Ammaia,Ammaia,Ammaia,https://pleiades.stoa.org/places/255975,,,amphitheater,imperial,True,...,https://amphi-theatrum.de/3020.html,False,,,60.0,,,-7.39197,39.369905,566
274,contributaAmphitheater,Amphitheater at Contributa Iulia Ugultunia,Contributa,Contributa Iulia Ugultunia,https://pleiades.stoa.org/places/256126,,,amphitheater,imperial,True,...,https://amphi-theatrum.de/3093.html,False,,,72.0,,,-6.38932,38.347751,501


#### Data types in our csv

In [8]:
df.shape # notice that this is an attribute so no parentheses

(275, 24)

In [9]:
df.info() # notice that this is a methon, so we need parentheses

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 275 entries, 0 to 274
Data columns (total 24 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 275 non-null    object 
 1   title              275 non-null    object 
 2   label              275 non-null    object 
 3   latintoponym       261 non-null    object 
 4   pleiades           274 non-null    object 
 5   welchid            19 non-null     float64
 6   golvinid           166 non-null    float64
 7   buildingtype       275 non-null    object 
 8   chronogroup        275 non-null    object 
 9   secondcentury      275 non-null    bool   
 10  capacity           153 non-null    float64
 11  modcountry         275 non-null    object 
 12  romanregion        274 non-null    object 
 13  zotero             56 non-null     object 
 14  amphitheatrum      239 non-null    object 
 15  dimensionsunknown  275 non-null    bool   
 16  arenamajor         184 non

In [10]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

### Example: Shipwrecks Database

Download the [Shipwrecks Database](http://oxrep.classics.ox.ac.uk/docs/StraussShipwrecks.zip) for the [The Oxford Roman Economy Project](http://oxrep.classics.ox.ac.uk/). 

In [11]:
df = pd.read_excel("StraussShipwrecks.xlsx")

In [12]:
# df

#### Access one column

In [13]:
df['Sea area']

0       Adriatic
1       Adriatic
2       Adriatic
3       Adriatic
4       Adriatic
          ...   
1779         NaN
1780         NaN
1781         NaN
1782         NaN
1783         NaN
Name: Sea area, Length: 1784, dtype: object

#### What datatype is this?

In [14]:
type(df['Sea area'])

pandas.core.series.Series

#### Access two columns

In [15]:
df[['Name','Sea area']]

Unnamed: 0,Name,Sea area
0,Komiza,Adriatic
1,Lokunji,Adriatic
2,Maharac Cape,Adriatic
3,Mlin,Adriatic
4,Plavac B,Adriatic
...,...,...
1779,Olbia Late Roman 2,
1780,Olbia medieval,
1781,Isola della Gallinara (Savona),
1782,San Vito Lo Capo,


In [16]:
df[['Wreck ID','Name','Sea area']]

Unnamed: 0,Wreck ID,Name,Sea area
0,1,Komiza,Adriatic
1,2,Lokunji,Adriatic
2,3,Maharac Cape,Adriatic
3,4,Mlin,Adriatic
4,5,Plavac B,Adriatic
...,...,...,...
1779,9058,Olbia Late Roman 2,
1780,9059,Olbia medieval,
1781,9060,Isola della Gallinara (Savona),
1782,9061,San Vito Lo Capo,


Check [here](https://www.w3schools.com/python/pandas/pandas_series.asp).

#### What datatype is this now?

In [17]:
type(df[['Name','Sea area']])

pandas.core.frame.DataFrame

#### Get all columns

In [18]:
# df

### Get all rows
- iloc (integer location)
- loc

### 1) iloc

In [19]:
df.iloc[0] #gives the first row

Wreck ID                                                                   1
Strauss ID                                                               331
Name                                                                  Komiza
Parker Number                                                            NaN
Sea area                                                            Adriatic
Country                                                              Croatia
Region                                                            Vis Island
Latitude                                                            43.03333
Longitude                                                           16.08333
Min depth                                                               30.0
Max depth                                                               30.0
Depth                                                                    30m
Period                                                        Roman Imperial

#### How can we get the first two rows?

In [20]:
df.iloc[0:2] #gets the first two rows

Unnamed: 0,Wreck ID,Strauss ID,Name,Parker Number,Sea area,Country,Region,Latitude,Longitude,Min depth,Max depth,Depth,Period,Dating,Earliest date,Latest date,Date range,Mid point of date range,Probability,Place of origin,Place of destination,Reference,Comments,Amphorae,Marble,Columns etc,Sarcophagi,Blocks,Marble type,Other cargo,Hull remains,Shipboard paraphernalia,Ship equipment,Estimated tonnage,Amphora type
0,1,331,Komiza,,Adriatic,Croatia,Vis Island,43.03333,16.08333,30.0,30.0,30m,Roman Imperial,C1st-2nd AD,1.0,200.0,,,,Egypt,Northern Italy,"M. Jurišić, Ancient Shipwrecks of the Adriatic...",Completely looted site with sherds of Egyptian...,True,False,False,False,False,,,,,,,
1,2,328,Lokunji,,Adriatic,Croatia,Kvarner region,44.7,14.28333,4.0,12.0,12m,Roman Imperial,C 1st AD,1.0,100.0,,,,Cos,Northern Italy,"D. Vrsalović, Istraživanja i Zaštita Podmorški...",,True,False,False,False,False,,,,,,,


#### How can we get the first column of the above df?

In [21]:
df.iloc[0:2]['Sea area'] #gets the "Sea Area" for the first two rows

0    Adriatic
1    Adriatic
Name: Sea area, dtype: object

### 2) loc
In this way, we do not get the first, second, etc., row. We get the row that has the index we specify.

In [22]:
df.loc[15] #gets the 15th row

Wreck ID                                                                  16
Strauss ID                                                               503
Name                                                                  Tyre F
Parker Number                                                           1189
Sea area                                               Eastern Mediterranean
Country                                                              Lebanon
Region                                             On the south side of Tyre
Latitude                                                                 NaN
Longitude                                                                NaN
Min depth                                                                NaN
Max depth                                                                NaN
Depth                                                                Shallow
Period                                                    Hellenistic/ Roman

### Let's grab some data

#### Get the Country column

In [23]:
df['Country'] #gets the entire country column

0       Croatia
1       Croatia
2       Croatia
3       Croatia
4       Croatia
         ...   
1779      Italy
1780      Italy
1781      Italy
1782      Italy
1783      Italy
Name: Country, Length: 1784, dtype: object

#### Count with `.value_counts()`

In [24]:
df['Country'].value_counts() #counts unique appearances of each value

Italy                   247
France                  217
Greece                   94
ZZ-Non-Mediterranean     69
Croatia                  63
Turkey                   62
Israel                   56
Spain                    30
Bulgaria                 18
Cyprus                   13
Albania                  11
International waters      9
Libya                     6
Lebanon                   5
Egypt                     5
Tunisia                   2
Montenegro                2
India                     1
Romania                   1
Minorca                   1
Syria                     1
Italy - Sicily            1
Malta                     1
Sudan                     1
Name: Country, dtype: int64

### Set index `.set_index()`

In [25]:
#sorts by a certain argument - assigns that argument to be the index -- 
# ("index" is the name of the internally assigned numeration)
df.set_index('Country')

Unnamed: 0_level_0,Wreck ID,Strauss ID,Name,Parker Number,Sea area,Region,Latitude,Longitude,Min depth,Max depth,Depth,Period,Dating,Earliest date,Latest date,Date range,Mid point of date range,Probability,Place of origin,Place of destination,Reference,Comments,Amphorae,Marble,Columns etc,Sarcophagi,Blocks,Marble type,Other cargo,Hull remains,Shipboard paraphernalia,Ship equipment,Estimated tonnage,Amphora type
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1
Croatia,1,331,Komiza,,Adriatic,Vis Island,43.03333,16.08333,30.0,30.0,30m,Roman Imperial,C1st-2nd AD,1.0,200.0,,,,Egypt,Northern Italy,"M. Jurišić, Ancient Shipwrecks of the Adriatic...",Completely looted site with sherds of Egyptian...,True,False,False,False,False,,,,,,,
Croatia,2,328,Lokunji,,Adriatic,Kvarner region,44.70000,14.28333,4.0,12.0,12m,Roman Imperial,C 1st AD,1.0,100.0,,,,Cos,Northern Italy,"D. Vrsalović, Istraživanja i Zaštita Podmorški...",,True,False,False,False,False,,,,,,,
Croatia,3,329,Maharac Cape,,Adriatic,Mljet island,42.73333,17.66666,3.0,20.0,3-20m,Roman Imperial,C 1st AD,1.0,100.0,,,,Rhodes,Northern Italy,"D. Vrsalović, PhD thesis (unpublished; 1979), ...",,True,False,False,False,False,,Eastern coarse ware pottery of biconical dishe...,,,,,
Croatia,4,330,Mlin,702,Adriatic,Split channel,43.45000,16.23333,25.0,40.0,40m,Roman Imperial,C 1st-2nd AD,1.0,200.0,,,,Aegean,Northern Italy,"D. Vrsalović, Istraživanja i Zaštita Podmorški...",A heavily looted mixed cargo of Greek amphorae...,True,False,False,False,False,,,Remains of the hull.,,"Two lead anchor stocks, though possibly not fr...",,
Croatia,5,322,Plavac B,832,Adriatic,"Zlarin Island, central Dalmatia",,,,,,Roman Imperial,C1st AD,1.0,100.0,,,,,,,,False,False,False,False,False,,,,,.,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Italy,9058,0,Olbia Late Roman 2,,,Sardinia,40.91148,9.50738,0.0,0.0,0m,Late Roman,C5th AD,400.0,500.0,0.0,0.0,0.0,,,"R. D'Oriano, Archeologia in Sardegna, 2007, 9...",Three other C5th AD wrecks found during constr...,False,False,False,False,False,,,,,,0.0,
Italy,9059,0,Olbia medieval,,,Sardinia,40.91148,9.50738,0.0,0.0,0m,Medieval,C11th-15th AD,1000.0,1500.0,0.0,0.0,0.0,,,"R. D'Oriano, Archeologia in Sardegna, 2007, 9...","Five small wrecks from the eleventh, fourteent...",False,False,False,False,False,,,,,,0.0,
Italy,9060,0,Isola della Gallinara (Savona),,,Liguria,44.02648,8.22996,50.0,50.0,50m,Roman Republic,C1st BC,-100.0,-1.0,0.0,0.0,0.0,Campania,,http://www.archeosub.it/news2003/news0311.htm#...,A mound of intact Dr 1 amphorae and 'other car...,True,False,False,False,False,,,,,,0.0,
Italy,9061,0,San Vito Lo Capo,,,Sicily,38.18825,12.74551,0.0,0.0,,Norman,C12th AD,1100.0,1200.0,0.0,0.0,0.0,,,"F. Faccenna, Il Relitto di San Vito Lo Capo (2...",A local Sicilian ship carrying amphorae and ja...,True,False,False,False,False,,,,,,0.0,


#### Did the original dataframe change?

In [26]:
# no
# df

#### let's put the parameter `inplace=True`

In [27]:
df.set_index('Country', inplace=True)

#### What will happen of I look for `df.loc[0]`?

In [28]:
df.loc[0] #there is no more index with 1

KeyError: 0

#### Let's reset the index

In [None]:
df.reset_index(inplace = True)

In [None]:
# df

### Sort `.sort_index(ascending=False)`

In [None]:
df.sort_index(ascending=False)

In [None]:
# because no inplace = true , this doesn't change the dataframe. Just changes the way one is looking at it

## Filter

#### what's the result of the below statement?

In [None]:
df['Country'] == 'Greece' #the filter returns every row's value at this variable

In [None]:
df[(df['Country'] == 'Greece')] #filters for Greece

In [None]:
df[(df['Wreck ID'] == 100)] #filters for Wreck ID = 100 ---- note the data type!

In [None]:
df[(df['Country'] == 'Greece') & (df['Period'] == 'Roman Imperial')] #filters for Greece & Roman Imperial

#### What about if we want everything but Greece?

In [None]:
df[~(df['Country'] == 'Greece')] #filters for not Greece
# df[~(df['Country'] != 'Greece')] does the same thing

In [None]:
df[~(df['Wreck ID'] > 100)]

#### Combine two filters with `&` = `and`

In [None]:
df[(df['Country'] == 'Greece') & (df['Period'] == 'Roman Imperial')] #filters for Greece & Roman Imperial

#### Combine two filters with `|` = `or`

In [None]:
df[(df['Country'] == 'Greece') | (df['Country'] == 'Italy')] #filters for Greece or Italy