# Lab 01: Getting Started

In this lab, we will walk through configuring Google Colab so that you can download the starter code for each lab, run and edit code, and then save and submit it.

Colab is a cloud-based solution that allows you to run Python and Jupyter notebooks from your browser, without having to configure anything on your personal computer.

Before continuing with this notebook, please first complete this brief tutorial: [Overview of Colab](https://colab.research.google.com/notebooks/basic_features_overview.ipynb)


<hr>

## Command-line

Welcome back!

What Colab does is create a [virtual machine](https://en.wikipedia.org/wiki/Virtual_machine), which is like a fresh install of an operating system made just for you.

This means that in addition to running Python code in this notebook, you can also interact with the command-line as if you were using a terminal to navigate a computer. If you have never used the command-line before, it is worth reading through [this tutorial](https://computers.tutsplus.com/tutorials/navigating-the-terminal-a-gentle-introduction--mac-3855).

Notebooks also allow you to run shell commands by preceding the command with a `!`.

For example, the command below will list the current working directory where this notebook is running.


In [28]:
!pwd

/content/cmps3160/_labs


This tells us that we are running Ubuntu, a Linux distribution, and also says what version we're running.

To list the contents of a folder, you can use the `ls` command. This will list the contents of the current working directory:



In [29]:
!ls

data	Lab01  Lab03  Lab05  Lab07  Lab09  Lab11  old	     _tmp.md
images	Lab02  Lab04  Lab06  Lab08  Lab10  Lab12  README.md


We can see the contents of the `sample_data` directory like this:

In [30]:
!ls sample_data

ls: cannot access 'sample_data': No such file or directory


If expand the left sidebar in Colab, you should see a folder icon. This will show the typical file navigation menu, and you should see the `sample_data` folder there as well.

And the following command will list information about the operating system that this notebook is running on. Note that the `cat` command prints the contents of a file. So, this is printing the contents of all files in the `/etc/` folder that end in the word `release,` which is where Linux distributions store OS information.

In [31]:
!cat /etc/*release

DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.3 LTS"
PRETTY_NAME="Ubuntu 22.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.3 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy


## Cloning the course repository

For many labs and demos, you will need to access data that is stored in the class GitHub repository, which is here:

https://github.com/nmattei/cmps3160/

If you have never used GitHub before, git is one of the most widely used version control management systems today, and invaluable when working in a team. GitHub is a web-based hosting service built around git that supports hosting git repositories, user management, etc. There are other similar services, e.g., BitBucket and GitLab.

Our use of git/github for the class will be minimal; however, we encourage you to use it for collaboration for your class project, or for other classes, or for anything because it's great. To learn more about GitHub, see [this tutorial](https://docs.github.com/en/get-started/quickstart/hello-world). Note -- you don't need to do that tutorial to complete this notebook.

The main thing we want to do is clone the course files into this Colab virtual machine. To do so, we will issue a `git clone` command. This will copy all the files from the course Github to our virtual machine:

In [32]:
!git clone https://github.com/nmattei/cmps3160.git

Cloning into 'cmps3160'...
remote: Enumerating objects: 1811, done.[K
remote: Counting objects: 100% (701/701), done.[K
remote: Compressing objects: 100% (271/271), done.[K
remote: Total 1811 (delta 425), reused 618 (delta 377), pack-reused 1110[K
Receiving objects: 100% (1811/1811), 48.25 MiB | 24.03 MiB/s, done.
Resolving deltas: 100% (1026/1026), done.


We should now see a folder called `cmps3160`, which contains a copy of the GitHub repository:

In [33]:
!ls

cmps3160  images  Lab02  Lab04	Lab06  Lab08  Lab10  Lab12  README.md
data	  Lab01   Lab03  Lab05	Lab07  Lab09  Lab11  old    _tmp.md


In [34]:
!ls cmps3160

404.html      css     Dockerfile    img        js	 LICENSE    resources.md  tags.html
CHANGELOG.md  _data   Gemfile	    _includes  _labs	 _projects  schedule.md
_config.yml   _demos  Gemfile.lock  index.md   _layouts  README.md  syllabus.md


## Changing Directories

To change the current working directory, we will use the `cd` command. Note that we need to prefix this with a % symbol, to ensure the directory change will persist to the next cells.

In [35]:
%cd cmps3160/_labs

/content/cmps3160/_labs/cmps3160/_labs


In [36]:
!pwd

/content/cmps3160/_labs/cmps3160/_labs


In [37]:
# Now, we can list the contents of the `_labs` folder.
!ls

data	Lab01  Lab03  Lab05  Lab07  Lab09  Lab11  old	     _tmp.md
images	Lab02  Lab04  Lab06  Lab08  Lab10  Lab12  README.md


In [38]:
# change back the working directory to /content
%cd /content
!ls

/content
cmps3160  drive  sample_data


## Non-persistence of Colab Virtual Machines

An important thing to note about Colab is that files you create during the session will not persist once the runtime shuts down. Google creates these temporary virtual environments to host your notebook, but it shuts them down so the resource can be reallocated to other notebooks. The runtime will shutdown automatically if not used for a few hours, so be careful about files that are created during the session.

This means that the `cmps3160` folder that we just created will disappear if we restart the session. You can test this by clicking on `Runtime->Disconnect and delete Runtime`. If you do so, you'll notice `cmps3160` is gone:

In [39]:
!ls

cmps3160  drive  sample_data


To get it back, you'll have to issue the `git clone` command above again.


While these newly created files will disappear, note that the code you write in this notebook should be fine. It should be saved in the `Colab Notebooks` folder in the root of your Google Drive. Please be sure to save frequently as you are working on assignments.

## Mounting Your Google Drive

On some occasions, you may want to create data that will persist. For example, when working on your course project, you don't want to re-collect any data you need for your analysis.

One way to make this data persist is to write directly to your Google Drive, rather than to this virual machine. To do so, we can use a Python command to "mount" the Google Drive. This will pop up a screen asking you to give this Colab notebook access to your Google Drive. **If you have multiple Google accounts, please be sure to use the same one consistently throughout the course.**:

In [13]:
# Mount our personal google drive. This will pop up a
# confirmation screen giving this notebook access to your Google drive.
# You will first need a gmail account for this to work.
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


You should now see the contents of your Google drive by navigating to the folder icon in the left panel. It is viewable at `/content/drive/MyDrive`.

You can also use the `ls` command to list the contents of your Google Drive:

In [14]:
!ls /content/drive/MyDrive

 Admin				    S7
 ADPI				    S8
 APO				   'SOCI 2180 - Final Assignment Rough Draft v2.5.gdoc'
'basc0018 bard v1s.gdoc'	   'Stockholm, Sweden.gmap'
'BASC0018 GPT Draft.gdoc'	    test_file.txt
'BASC0028 GPT RD1.gdoc'		   'Untitled document (1).gdoc'
'BASC0028 RD1.gdoc'		   'Untitled document (2).gdoc'
'basc0047 bag 1.gdraw'		   'Untitled document (3).gdoc'
'basc0047 bag 2.gdraw'		   'Untitled document (4).gdoc'
'BASC0047 Semantic Diagram.gdraw'  'Untitled document (5).gdoc'
'Colab Notebooks'		   'Untitled document (6).gdoc'
 FXXD5-BASC0018-2.gdoc		   'Untitled document (7).gdoc'
 FXXD5-BASC0028-1.gdoc		   'Untitled document (8).gdoc'
 FXXD5-BASC0047-1.gdoc		   'Untitled document (9).gdoc'
 RD3.gdoc			   'Untitled document.gdoc'
 RD4.gdoc			   'Untitled drawing.gdraw'
 Rowing				   'Untitled presentation.gslides'
 S1				   'Untitled spreadsheet (1).gsheet'
 S2				   'Untitled spreadsheet (2).gsheet'
 S2.5				   'Untitled spreadsheet (3).gsheet'
 S3				   'Untitled spreadsheet (4).gshee

To change the current working directory to be the location of your Google drive, we will issue a `cd` command preceded by the % symbol. The difference between ! and % is that % will actually have a persistent effect on the notebook.

In [15]:
%cd /content/drive/My Drive

/content/drive/My Drive


In [16]:
!pwd

/content/drive/My Drive


To test that this works, we will create a text file and write a sentence to it.

In [17]:
outf = open('test_file.txt', 'wt')
outf.write('Hello from Colab!')
outf.close()

We can see that this file exists by printing its contents here:

In [18]:
!cat test_file.txt

Hello from Colab!

You should also see this file when you navigate to your Google Drive separately in your browser.

This file will remain here even if you shutdown the runtime of this notebook.

## Data for Labs

The data used in the labs is in the `_labs/data` folder of the course repository.

Since we already cloned the repository, let's naviage to the `_labs` folder:

In [19]:
%cd /content/cmps3160/_labs/
!ls

/content/cmps3160/_labs
data	Lab01  Lab03  Lab05  Lab07  Lab09  Lab11  old	     _tmp.md
images	Lab02  Lab04  Lab06  Lab08  Lab10  Lab12  README.md


The `data` folder contains all the data used for the labs.

In [20]:
!ls data

ames.tsv  movielens.zip  names.zip  reds.csv  tips.csv	titanic.csv  whites.csv


The next lab will work with the file `titanic.csv`, which contains information on passengers of the ill-fated Titanic passenger ship. You can use the command `head` to see the first ten lines of this file.

In [21]:
!head data/titanic.csv

﻿pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
1,1,"Allen, Miss. Elisabeth Walton",female,29,0,0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
1,0,"Allison, Miss. Helen Loraine",female,2,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
1,0,"Allison, Mr. Hudson Joshua Creighton",male,30,1,2,113781,151.5500,C22 C26,S,,135,"Montreal, PQ / Chesterville, ON"
1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
1,1,"Anderson, Mr. Harry",male,48,0,0,19952,26.5500,E12,S,3,,"New York, NY"
1,1,"Andrews, Miss. Kornelia Theodosia",female,63,1,0,13502,77.9583,D7,S,10,,"Hudson, NY"
1,0,"Andrews, Mr. Thomas Jr",male,39,0,0,112050,0.0000,A36,S,,,"Belfast, NI"
1,1,"Appleton, Mrs. Edward Dale (Charlotte Lamson)",female,53,2,0,11769,51.4792,C101,S,D,,"Bay

We can see that this is a comma-separated file, where each row contains information on a ship passenger.

## Pandas

Pandas is a Python library that we will be using extensively to store and analyze data. You can find a brief overview of Pandas [here](https://pandas.pydata.org/docs/user_guide/10min.html).

The key data structure in Pandas is the [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html), which is conceptually similar to an Excel spreadsheet.

Below, we import the `pandas` library and read `titanic.csv` into a new DataFrame object called `df`.

Note that the path we enter here will depend on the current working directory. Here, we assume we are already in the `/content/cmps3160/_labs` folder, so we use the **relative path** `data/titanic.csv` in order to read in the file. This path is relative to the current working directory.

We can print the first five rows of the DataFrame using the `.head()` command.

In [22]:
import pandas as pd
df = pd.read_csv('data/titanic.csv')
df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [23]:
# How many rows are there?
len(df)

1309

In the next labs, will work with DataFrames in more detail. For now, please complete the short exercises below and submit your notebook to Canvas,

## Exercises

**1. Change the current working directory to the folder in `cmps3160/_demos/data/`.**

In [40]:
%cd cmps3160/_demos/data/

/content/cmps3160/_demos/data


Expected output:

```
/content/cmps3160/_demos/data
```

**2. List the contents of this directory.**

In [41]:
! ls

adult.csv      bodyfat.csv  iris.csv  nba_salaries.csv	religon.csv	     titanic.csv
billboard.csv  boundry.png  iris.png  nba_stats.csv	review_polarity.zip


Expected output:

```
adult.csv      boundry.png  nba_salaries.csv  review_polarity.zip
billboard.csv  iris.csv     nba_stats.csv     titanic.csv
bodyfat.csv    iris.png     religon.csv
```

**3. Oh look, there's another `titanic.csv` file in here. Read it into a new DataFrame called `df2` and print the first row.**

In [44]:
df2 = pd.read_csv('titanic.csv')
df2.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


**4. DataFrame objects have a  `.describe()` method that summarizing the data. Run it below:**

In [45]:
df2.describe()

Unnamed: 0,pclass,survived,age,sibsp,parch,fare,body
count,1309.0,1309.0,1046.0,1309.0,1309.0,1308.0,121.0
mean,2.294882,0.381971,29.881135,0.498854,0.385027,33.295479,160.809917
std,0.837836,0.486055,14.4135,1.041658,0.86556,51.758668,97.696922
min,1.0,0.0,0.1667,0.0,0.0,0.0,1.0
25%,2.0,0.0,21.0,0.0,0.0,7.8958,72.0
50%,3.0,0.0,28.0,0.0,0.0,14.4542,155.0
75%,3.0,1.0,39.0,1.0,0.0,31.275,256.0
max,3.0,1.0,80.0,8.0,9.0,512.3292,328.0


Expected output:


|       |      pclass |    survived |       age |       sibsp |       parch |      fare |     body |
|:------|------------:|------------:|----------:|------------:|------------:|----------:|---------:|
| count | 1309        | 1309        | 1046      | 1309        | 1309        | 1308      | 121      |
| mean  |    2.29488  |    0.381971 |   29.8811 |    0.498854 |    0.385027 |   33.2955 | 160.81   |
| std   |    0.837836 |    0.486055 |   14.4135 |    1.04166  |    0.86556  |   51.7587 |  97.6969 |
| min   |    1        |    0        |    0.1667 |    0        |    0        |    0      |   1      |
| 25%   |    2        |    0        |   21      |    0        |    0        |    7.8958 |  72      |
| 50%   |    3        |    0        |   28      |    0        |    0        |   14.4542 | 155      |
| 75%   |    3        |    1        |   39      |    1        |    0        |   31.275  | 256      |
| max   |    3        |    1        |   80      |    8        |    9        |  512.329  | 328      |


5. By looking at the output and reading the [method documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) answer the following questions:

**5a. What fraction of passengers survived (rounded to two decimal places)?**

38.2%

**5b. What was the median fare (rounded to the nearest dollar)?**

$14.45

**5c. Were there more 1st class passengers or 3rd class passengers?**

There were more 3rd class passengers.

**To submit:**

1. File->Download .ipynb
2. Upload the .ipynb file to the appropriate assignment in [Canvas](https://tulane.instructure.com/)