# Ewan Jones - Feeder 3.1

## Instructions Overview

The following instructions should guide you in how to compile a new subset of data taken from a publicly available dataset, all using Python3. 
- This is designed for someone with little to no experience with Anaconda, Python3, or even data in general, though some amount of common sense is certainly required.

The overall process will proceed as follows: 
1. Set up a **project directory**, open **Anaconda** and **JupyterLab**
2. Import necessary **packages**
3. Read in a `.csv` file to create a **dataframe**
4. **Filter** to create a **subset** of that dataframe that only contains relevant information
5. Export new **subset** as another `.csv`

## Getting Started

First, create a folder where you will keep all the files involved in this process. 
- Download the csv file labelled `EarthquakeData.csv` and move it to this directory. 
- Also make sure this **.ipynb** file is in that folder.

Open **Anaconda Navigator**. Open **JupyterLab** by clicking on its "Launch" button.
- If you're reading this, you've already done this step

In the left pane of the window that opens in your browser, navigate to the designated directory that you made.
- *It is important to note that the software does not live-update, so if you have not created this folder before opening JupyterLab, you will have to quit and restart after doing so.*

To follow along in your own **.ipynb** file, click the blue plus button in the top-left corner of the **JupyterLab** interface.
- Select **Python 3 (ipykernel)** from below **Notebook**
- This file may be renamed within the left pane by right-clicking on it and selecting **Rename**

### Packages

To properly work with the dataset within Python3, it is also necessary to import a few **packages** that will assist us in creating a **dataframe** from our data.

The first of these is called **"numpy"**, which can be imported with the following command, entered in the right pane of JupyterLab:

In [13]:
import numpy as np

The second is called **"pandas"**, which can be imported with a similar command:

In [14]:
import pandas as pd

We make sure not just to *import* these packages, but to also give them nicknames (`np` and `pd`) since this will make referencing them much quicker in the future.

## Creating a Dataframe

The next step in the process is creating a dataframe. To do this, we will use a function called `pd.read_csv()`.

### Using `pd.read_csv()`

We can define the dataframe we are creating as `df` by setting it equal to function `pd.read_csv()` and identifying the `.csv` file we want to use within the parentheses in quotatation marks. This only works because we have navigated to the folder where the `EarthquakeData.csv` file is located. 
- The following command executes this step:



In [15]:
df = pd.read_csv("EarthquakeData.csv")

To see what a random row of this dataframe looks like, we can use the function `.sample()`:

In [47]:
df.sample()

Unnamed: 0,time,latitude,longitude,depth,mag,magType,nst,gap,dmin,rms,...,updated,place,type,horizontalError,depthError,magError,magNst,status,locationSource,magSource
1028,2021-11-30T05:03:37.524Z,-60.2774,-25.4681,10.0,4.6,mb,,81.0,14.539,0.46,...,2022-02-05T23:03:57.040Z,South Sandwich Islands region,earthquake,7.7,1.9,0.082,44.0,reviewed,us,us


Leaving the space in the parentheses blank will produce one row, but more rows can also be requested using other numbers:

In [48]:
df.sample(4)

Unnamed: 0,time,latitude,longitude,depth,mag,magType,nst,gap,dmin,rms,...,updated,place,type,horizontalError,depthError,magError,magNst,status,locationSource,magSource
11762,2022-07-07T14:53:40.689Z,-22.317,-177.8858,321.7,4.6,mb,48.0,125.0,6.916,1.02,...,2022-09-13T06:23:34.040Z,south of the Fiji Islands,earthquake,15.0,8.1,0.047,134.0,reviewed,us,us
1580,2021-12-08T22:21:38.559Z,44.0143,-129.242,10.0,4.6,mb,,187.0,1.839,0.66,...,2022-02-12T23:07:02.040Z,off the coast of Oregon,earthquake,3.8,1.9,0.057,93.0,reviewed,us,us
5800,2022-03-02T17:02:35.235Z,49.4744,155.7967,64.95,4.9,mb,,80.0,3.829,0.68,...,2022-05-07T21:32:43.040Z,"135 km S of Severo-Kuril’sk, Russia",earthquake,9.0,5.6,0.035,250.0,reviewed,us,us
3490,2022-01-16T19:29:37.336Z,-59.1201,-25.9951,50.05,4.4,mb,,54.0,7.535,0.89,...,2022-04-01T17:40:31.040Z,South Sandwich Islands region,earthquake,8.8,9.0,0.117,21.0,reviewed,us,us


## Creating a subset of our dataset

While we could just keep the `df` dataframe we just made, it contains a lot of information that is not relevant to the research questions we might want to ask. For the purposes of this exercise, the subset we end up creating will only include the following variables (or columns), and earthquakes from only the southern hemisphere.
- **Time** (which includes both date and 24-hour time)
- **Latitude** and **Longitude** (very helpful for future visualizations)
- **Depth**
- **Mag** or Magnitude
- **Place** (we include this as a readable indication of location, though the lat/long values are more computationally significant)

### Defining a new dataframe

To create our subset, we will begin by defining a new, intermediate dataframe, `Earthquake_south`, as being only rows of our original dataframe, `df`, that have values of the column **latitude** less than 0 (the latitude of the equator). This will encompass only the southern hemisphere.

- The following command defines our new dataframe:

In [26]:
Earthquake_south = df[df["latitude"] < 0]

To see how this has filtered our previous dataframe, we can once again use the function `.sample()`:

In [55]:
Earthquake_south.sample(4)

Unnamed: 0,time,latitude,longitude,depth,mag,magType,nst,gap,dmin,rms,...,updated,place,type,horizontalError,depthError,magError,magNst,status,locationSource,magSource
4787,2022-02-09T19:27:57.539Z,-56.4952,-26.7052,107.33,4.5,mb,,105.0,5.999,0.25,...,2022-04-19T17:40:19.040Z,South Sandwich Islands region,earthquake,14.2,7.4,0.156,12.0,reviewed,us,us
9369,2022-05-13T04:00:02.213Z,-24.442,-179.9574,520.45,4.3,mb,19.0,115.0,10.228,0.66,...,2022-07-22T18:31:00.040Z,south of the Fiji Islands,earthquake,13.4,12.2,0.147,15.0,reviewed,us,us
2653,2021-12-30T10:20:37.091Z,-9.2556,161.2634,10.0,4.6,mb,,110.0,1.311,1.18,...,2022-03-05T23:11:11.040Z,"82 km SE of Auki, Solomon Islands",earthquake,9.0,1.9,0.146,14.0,reviewed,us,us
3603,2022-01-18T14:20:59.846Z,-5.6212,146.2083,79.66,4.5,mb,,63.0,3.737,0.59,...,2022-04-01T17:40:34.040Z,"64 km SE of Madang, Papua New Guinea",earthquake,9.6,7.6,0.097,31.0,reviewed,us,us


All latitudes are negative! But this is not all the filtering we said we wanted to do.

### Filtering with the `.loc` function

The next step is to create a final, `Earthquake_subset`, dataframe by filtering `Earthquake_south` with a function called `.loc` that filters using the names of columns as strings (their names).

First, we identify the dataframe we just made, `Earthquake_south`, and insert our function. In brackets, we indicate what rows to run the filter over (in this case, all of them) by simply putting a `:`. After a comma, internal brackets contain a list identifying each of the aforementioned variables of interest in quotes.

- The following command will achieve this step:

In [27]:
Earthquake_subset = Earthquake_south.loc[:,["time","latitude","longitude","depth","mag","place"]]

A random sample using `.sample()` shows that we now have a fully filtered set of data:

In [52]:
Earthquake_subset.sample(4)

Unnamed: 0,time,latitude,longitude,depth,mag,place
16193,2022-10-10T11:19:02.274Z,-29.7808,-71.6477,32.78,4.8,"35 km WNW of Coquimbo, Chile"
10279,2022-06-04T02:59:28.434Z,-58.7733,-25.5884,43.549,5.6,South Sandwich Islands region
11899,2022-07-09T15:44:44.473Z,-9.7314,112.952,41.888,4.9,"south of Java, Indonesia"
2595,2021-12-29T12:29:07.402Z,-19.3741,167.8158,10.0,4.1,"155 km W of Isangel, Vanuatu"


We only have the variables that we listed as relevant to our research, and all latitude values are still negative.

## Exporting our subset as `.csv`

Now that we have successfully filtered our original dataset not once, but *twice*, to create a new subset, we can export it as a `.csv` file.

We will use the function `.to_csv()`. All we have to do is indicate the dataframe to export, `Earthquake_subset`, before the function, and what we want to name the file in quotes within the parentheses.

It is also necessary to specify `index=False` to ensure that the exported file does not include the numbered row indices that would otherwise be included by default. 

- The following command executes this step:

In [56]:
Earthquake_subset.to_csv("Earthquake_subset.csv", index=False)

You are now done! Enjoy.