## Overview

The goal of this lab is to familiarize yourselves with Python so that we can begin handling, processing, analyzing and visualizing data. This lab will tour you through the basics of Python and will ask that you create a series of simple algorithms as a final deliverable. 



# Setting up

If you aren't already viewing these instructions from within Jupyterlab, then clone the current repository to your computer and open this document from within Jupyterlab.

Reminder: To do this, there are several steps you should consider which we went over in a previous lab.
- cloning this repository
- creating a virtual environment
- activating the virtual environment
- installing Jupyterlab in the virtual environment
- starting Jupyterlab
- once inside Jupyterlab, open this .ipynb notebook

You could alternatively not install a virtual environment for this lab session, and use the one you created for a previous lab. Or, you could not use a virtual environment and just install Jupyterlab globally. It is ultimately up to you to manage your own setup, however, as projects become more complex and involve more and more packages, you will want to get into the habit of creating a new environment for each project.

# Understanding Jupyter notebooks

You are currently reading text that was entered into a Markdown cell. In this class we will be using Code cells (where you can write Python code) and Markdown cells (where you can redact text using Markdown styling).

- You can create a new cell by hitting the **+** icon at the top-left of the notebook tab window (or by pressing **a** when no cells are selected).
- When adding a cell, a Code cell is created by default, but you can convert this cell into a Markdown cell by selecting *Markdown* from the dropdown menu at the top of the notebook tab window while the cell is selected (or you can just press **m** with the cell selected).

Normally, in a standard Python script (.py file), we would use the hash symbol (#) at the start of a line whenever we wanted to add a comment to our code. However, Jupyter notebooks allow us to go a step further by creating Markdown cells, which allow for more readable and elaborate commenting. This is mostly what you will be using to comment on your work in this class. 

- Once you have entered code or text which you are happy with, you can press **Shift+Enter** or **Ctrl/Cmd+Enter** to run the cell (or you can use the play button at the top of the tab window). This will render the Markdown text if it is a Markdown cell, and run the Python script if it is Code cell.
- To edit a rendered Markdown cell, double-click on it.

Note as well that the directory in which the present notebook (.ipynb file) is located is visible on the left pane. Using this browser, you can open files from within Jupyter: your entire workspace (files and folders inside your current directory) is technically accessible from within Jupyterlab, so if that simplifies things for you, then use it! In fact, you can also open a console in Jupyter:

- Click the big plus-sign (**+**) icon on the far left, located under the *File* dropdown menu, to open a new tab.
- Click on **Terminal**. This will open a new console window which could be useful for several things while Jupyter is running (e.g. installing python packages, running the Python prompt, using git, etc.). Note that if you are using Jupyterlab inside an activated virtual environment, your new console will also be using the Python version and packages from this virtual environment (Yay!).

For more on Jupyter, see the [official documentation](https://jupyterlab.readthedocs.io/en/stable/user/interface.html) and this [video guide](https://www.youtube.com/watch?v=A5YyoCKxEOU&ab_channel=Jupyter%2FIPython). For more on running cells and some of the intricacies involving code execution order, [watch this video](https://youtu.be/oJ6z02N0Te0?t=330) from 00:05:30 onward, which highlights some issues you will definitely face as your notebooks become more elaborate!

---

# Playing with Python

These first exercises introduce the basics of Python using some hands-on examples and exercises.

## Python variables and object types

>Python is dynamically typed, meaning that any variable can be assigned any value. There are no type declarations. A variable that holds an integer can then be assigned a string, for example. Primitive types include integers, floats, strings (both single-byte and Unicode), and booleans (with literals True and False). Built-in container types include lists, dictionaries, and classes. ([source](https://pyhurry.readthedocs.io/en/latest/first.html#types))

In Python as with other languages, we assign values (data) to variables. Here are some variable naming rules and conventions you should follow:
- Use lowercase without spaces.
- Use underscores in place of spaces.
- Name variables descriptively.
- Do not start variable name with a number.
- Avoid existing function and built-in names.

For more variable naming schemes, especially when wanting to use multiword variables, see this [quick reference](https://curc.readthedocs.io/en/latest/programming/coding-best-practices.html#variable-naming-conventions).

In [142]:
firstStatement = 'hello class'
secondStatement = 'in the lesson below, run the code cells to see the results (they will only show if they are being printed). Also, feel free to modify the code as you want.'

print(firstStatement)
print(secondStatement)

hello class
in the lesson below, run the code cells to see the results (they will only show if they are being printed). Also, feel free to modify the code as you want.


### Booleans

Booleans are boring: True or False? Well, They can be useful sometimes.

In [None]:
iAmFalse = False
iAmTrue = True

Logical tests will result in a boolean. For example:

In [None]:
iAmFalse == iAmTrue

In [None]:
isThisTrue = iAmFalse == iAmTrue
print(isThisTrue)

📝 What is the difference between the two equal signs (=,==) above? How are they being used differently? Write your answer in the Markdown cell below.

📝 What is the difference difference between x and y below? What data types are they respectively? Write your answer in the Markdown cell that follows.

In [None]:
x = False
y = "False"

### Numbers

Numbers can come in different forms!

📝 Beyond the fact that they are different values, What is the main difference between the data types for x and y respectively? What are they? Write your answer in the Markdown cell located below the code cell.

In [None]:
x = 99
y = 4.9

### Strings

Strings are text data. Strings can be declared with either '' or "" marks.

In [None]:
'I am a string' == "I am a string"

📝 In the code cell below, declare an empty string with a variable named *emptyString*.

Strings have many built-in methods (all native Python objects do). But in this lab we will be exploring these a little further than other built-in methods.

For example, you can slice a string by selecting specific characters inside it using the index of its position in the string:

In [None]:
x = "Nov 1, 2021"
firstChar = x[0]

print(firstChar)

In [None]:
secondChar = x[1]

print(secondChar)

📝 In the code cell below, print the character 'v' from x. Refer to the examples above for guidance!

You can also slice ranges of string characters.

In [None]:
x = "Oct 21, 2021"

month = x[0:3]
day = x[4:6]
year = x[8:12]

#tip: concatenate variables inside a print statement like so
print(day,month,year)

You can use the split method to chunk a string into different elements based on a delimiter.

In [None]:
x = "Dec 24, 2021"

splitDates = x.split(',')
print(splitDates)

📝 What object type is *splitDates*? Use the *type* command in the code cell below to find out.

### Lists

Lists are like arrays (the common term for this kind of thing in other languages), and can contain any number of objects of any type. Lists are a very versatile data type which you will find yourself using often.

In [None]:
#using the [] will create an empty list
x = []
print(type(x))

In the code cell below, there are some examples of lists and some comments that guide you in understanding them. Comments are also used in these labs for guidance, much like this cleaner looking Markdown cell.

In [None]:
#below we declare a rather eclectic list
x = ["a",45,"$ 89",True,"Winter is here",None,"pay attention",[1,2,3,4],99.9]

#As you can see, lists can contain other lists. They can basically contain anything, in any order.
#You can write them out differently as well, for improved legibility.
#Let's declare this list once more, this time by separating each item by a new line.
x = [
    "a"
     ,45
     ,"$ 89"
     ,True
     ,"Winter is here"
     ,None
     ,"pay attention"
     ,[1,2,3,4]
     ,99.9
    ]

print(x)

#Note the item inside x called None. This is not a string with text "None" but is an object of type None. This is also a data type you will encounter in Python!
nothing = None
type(nothing)

📝 Lists have several built-in methods. Using the hash symbol, add comments below each operation below describing what exactly you think the list method is doing.

In [None]:
#We have already declared a list called x above.

x.pop(0)
print(x)

x.append('a')
print(x)



As you might have noticed, these list methods operate in-place on the list from which they're being called. This means you don't need to assign the result to a new variable: it does the work directly on the list you called it from, permanently changing that list.

### Sets

[Sets](https://docs.python.org/3.8/library/stdtypes.html#set-types-set-frozenset) are more rarely used in our context. They are **unordered**, list-like objects whose contained objects must be distinct, meaning there can be no duplicates. Like lists, sets are mutable (you can change what's inside it once its been created), but they do not contain the same number of built-in functions as lists do.

A great trick for removing duplicates from a list is by turning it into a set.

In [None]:
thelist = [1,6,3,6,3,6,3,5,9,9,1]
print("Check out this list: ",thelist)

newset = set(thelist)
print("Here is a set of distinct values: ",newset)
print("We can see that it is a set here: ",type(newset))

newlist = list(newset)
print("Our new list of distinct values: ",newlist)

### Tuples

Tuples are like lists, but they are immuatable (frozen) once you have created them. They can also only hold one kind of object: unlike lists. Unlike sets though, they allow duplicates.

In [None]:
# using the () will create an empty tuple
x = ()
print(type(x))

Tuples are great for creating objects you know you don't want to ever be changed. For example, they are great for holding coordinates! By being restrictive (in this case, using a tuple instead of a list), you narrow the scope of possibility for the variable, limiting what can be done to it and therefore having tighter control over it. In other words, it allows for cleaner and more predictable code.

In [None]:
montrealLoc = (45.49980145207762, -73.57467364738282)
print(montrealLoc)

### Dictionaries

Since Python 3.6, dicts - like lists - are **ordered** (this means that the order in which dict elements were created is the order in which they remain, that is, until you choose to re-order them).

Dictionaries are like lists in that they a mutable (changeable), flexible containers that can hold just about anything. The difference with lists though is that your elements can be labeled. This means that you can access items in a dictionary using a label, instead of using an index like you have seen with lists (e.g. list[position]).

In dictionaries, the label is called the *key*, whereas the item it points to is called its *value*.

In [None]:

# using the {} will create an empty dictionary
x = {}
print(type(x))


Let's create a dictionary. Joe is a mapping intern at Amtrak. You've scraped his Facebook profile, for some reason, and you've structured some of the data retrieved as follows:

In [None]:
joesProfile = {"name": "Joe Blo", "age": 26, "hometown": "Schenectady, NY"}

print(joesProfile)

You can access a particular value in a dict by selecting its key:

In [None]:
joesName = joesProfile["name"]
print(joesName)

In [None]:
#to add an item to a dict, simply declare the new key with its value as follows:
joesProfile['religion'] = 'new age'

print(joesProfile)

📝 In the code cell below, print the following sentence using the three values found in myDict:

"His name is Joe, he is 26 and he comes from Schenectady, NY"

Hint: Use string methods described above to retrieve "Joe" from "Joe Blo"... You might need to declare some variables before printing anything (break this problem down into simple steps!)

In [None]:
#print(...)

## Control flow

### Conditional statements (if, elif, else)

Conditional statements are very useful. They do a logical test, and if that test is true, allow the flow to enter its codeblock. Otherwise, they will skip whatever operations are nested within them and move on. You can stack conditional statements to see if data matches a series of conditions. Think of it as an investigation, where you ask a bunch of yes-no questions, and choose a certain follow-up question in accordance with the answer you got.

In [None]:
myVariable = 'The world is BIIIIG'

if 'i' in myVariable:
    print('Ok great')
elif myVariable[0:3] == 'The':
    print("'The' is a funny word isn't it")
else:
    print("fuhgettaboutit")



📝 Modify the variable above so that it prints "'The' is a funny word isn't it" instead of 'Ok great'

📝 Modify the variable above so that it prints "fuhgettaboutit"

## For loops

Writing for loops in Python is easy. Each iteration in the loop below declares a string in the list as the variable myLetter and then prints it if its length is equal to 1.

In [None]:
mylist = ['a','b','a','b','d','d','g','h','a','h','sdf','a','fsafd','sdfdsf']

for myLetter in mylist:
    if len(myLetter) == 1:
        print(myLetter)

📝 In the code cell below, write a for loop that prints only numbers that are less than 10 from the list you are provided with. Note that we did not go over arithmetic operators (+,-,\*,/,>,<,==) but they are fairly universal... If you need a refresher, go ahead and Google it!

In [None]:
myNumberList = [123,4,76,46,34,4,6,2,0,12,65,4,9,1,0,199,19000]

#start your code here

---

## Working with Amtrak as a junior data specialist

You're at a new job with Amtrak, and you're pretty good at Python. You have been given a messy inventory of stations by a supervisor for the entire United States. The supervisor isn't data legible and doesn't realize how much of a mess the data is. They have a cartography intern who isn't very savvy and would like a list of cities without the state name (or any other clutter that may be associated) to be used as labels for their hopefully good looking PDF maps. Even though this mapmaker could probably figure this out themselves, they clearly can't! Further, since interns come and go, the next intern might ask this of you again for another map in a couple months. Also, you can't guarantee that stations will always be in the same cities, especially given a recent boost in federal funding towards passenger rail infrastructure, it's likely that the number of stations will change over time.

You could clean this data in Excel, but that would mean having to do it again, manually, the next time, and again the time after that. Also, you really don't want to be manually keeping track of new station names.

You conclude that you should write a small script that can handle new incoming data and output clean city names for these ineffective mapping interns, but also to clean up the messy data you are regularly receiving from your supervisor so you can do more effective analyses with them. Basically, you will write a small algorithm that will automate a simple data wrangling task, mostly involving strings.

The list below is a sample of 10 stations from the data (which contained 1089 stations as of Oct 21 2021) that can be downloaded at the [Bureau of Transportation Statistics](https://www.bts.gov/ntad). The data was intentionally modified here to create a difficult but very realistic scenario!

![image.png](attachment:8dd53880-23f4-4540-89bf-fd9557aae130.png)

*A screenshot of an online map displaying Amtrak stations in the Northeast.*([src](https://www.amtrak.com/northeast-train-routes))

📝 In a new cell below, Write a Python algorithm that will create a new list containing only the city names and print that list using a print statement. Everything you need has been demonstrated to you previously in this lab. Your code shouldn't be more than a couple lines maximum.

In [None]:
sampleList = [
    "Aberdeen, MD21001, Station Building (with waiting room)"
    ,"Albuquerque, NM87102, Station Building (with waiting room)"
    ,"Antioch-Pittsburg, CA94509, Platform with Shelter"
    ,"Arcadia, MO63621, Station Building (with waiting room)"
    ,"Ardmore, OK73401, Station Building (with waiting room)"
    ,"Augusta, ME4330, Curbside Bus Stop only (no shelter)"
    ,"Ashland, OR97520, Curbside Bus Stop only (no shelter)"
    ,"Arkville, NY12406, Curbside Bus Stop only (no shelter)"
    ,"Ashland, KY41101, Platform with Shelter"
    ,"Alanson, MI49706, Curbside Bus Stop only (no shelter)"
]

You would also like to create a dictionary to store this data so that you can do analyses with it later. Your dictionary labels should probably be the city name followed by the state abbreviation, since there is a chance that two cities share the same name (though it is much more unlikely that two cities of a same name exist within the same state). Doing this will make sure you don't have any duplicate keys in your dictionary. Each key in this dict should contain all the relevant info that you're able to parse from the messy list provided above.

You imagine your dict as formatted like the following:

In [None]:
mystations = {
    "Baltimore, MD": [ 21201, "Station Building (with waiting room)" ],
    ...
}

📝 In the template code cell below and using *sampleList*, create a dict that imitates the structure suggested above. You will need to think through this task in steps. Comments are provided below to guide you. Focus and solve this ONE STEP AT A TIME.

In [None]:
#create your new, empty dict, which you will be populating
mystations = {}

#begin looping through your sampleList to retrieve its data, one line at a time
for station in sampleList:    
    #start by separating and assigning each piece of information to its own variable. You should create 4 variables here per item in sampleList.
    #(if you wish to use the variable declarations below, remove the hash symbols to uncomment)
    #this coult take you several steps...
    
    #cityName = 
    #stateAbbr = 
    #zipCode = 
    #stationType = 
    #...

print(mystations)

📝 From your dict, print the value of Aberdeen, MD from your dict in the code cell below.

📝 What type of object is the value for Aberdeen, MD? Locate it in your dict and print its type in the code cell below.

📝 Again from your dict, print the zip code of Aberdeen, MD in the code cell below.

The carto intern has no idea how to use dicts. They need a list of the zip codes so they can make a simple heatmap of stations across the US. 

📝 Using your dict, create a list simple list of ZIP codes in the code cell below.

# Using .gitignore files

If you are not already aware of what [.gitignore files](https://docs.github.com/en/get-started/getting-started-with-git/ignoring-files) are, then keep reading.

A *.gitignore* allows you to tell git what not to track or commit to your repository. Basically, if you have created some clutter (notes, subdirectories) in your local repository, or your code has generated some unwanted files you prefer not to share to your public repository hosted on Github, then you can tell git to ignore these inside a *.gitignore* file. This file needs to be located at the root directory of your repository (i.e. not inside a subdirectory).

To create a *.gitignore* file, you can do the following (some commands might be slightly different if you are in DOS):
- `touch .gitignore` to create the file
- `ls` to view your file in the directory

Oh! You can't see it. This is because adding a point before a file or folder name makes it hidden from view. This is true in Windows Explorer (PC) or in Finder (Mac) as well.

- To view hidden items, enter `ls -a` (for all). You can also add the `-l` argument for a more detailed display (`ls -al`).

Now you should see your new *.gitignore* file, as well as a folder called *.git*. Note that any directory with a *.git* folder inside it is a sign that it's an active git repository. This folder is created automatically and stores your version history, etc.

- To edit your *gitignore* file, open it in the text editor of your choice (if you are willing to do this directly from within the console, you can type `vi .gitignore` to open the file in Vim). If you want to use another text editor that you cannot access directly through a command, you will need to navigate to your directory and open it from outside the command line.

In your .gitignore file, you can use basically write out each file you would like to ignore on a separate line. If you wanted to ignore an entire folder, then you just need to write the folder name followed by a forward slash (*/*).

In my case, I would like git to ignore my virtual environment folder, which I happened to have created inside my repo. I would also like it to ignore the *.ipynb_checkpoints* folder that was created automatically by Jupyter.

- If this is true for you as well, then input the following text inside your *.gitignore*. If you are using Vim, then you will need to press *i* before inputting anything.

```
venv*/
.ipynb_checkpoints/
```

In the snippet above, *venv\*/* is used to match with any folder that starts with the word venv, followed by any number of other characters or digits. For more elaborate pattern matching guidance, you can refer to this [documentation](https://git-scm.com/docs/gitignore#_pattern_format). Add any other files or folders you would like to hide from your online repository...

- When you are done inputting what you want git to ignore, save and close your text editor (*:wq* in Vim).
- Now, enter `git status`: any untracked items you input into your gitignore file will no longer be visible.

❗ Removing files that are already tracked...

If you have already added and started tracking changes to a file which you would now like git to ignore, see the following instructions, quoted from [Github](https://docs.github.com/en/get-started/getting-started-with-git/ignoring-files):

> If you want to ignore a file that is already checked in, you must untrack the file before you add a rule to ignore it. From your terminal, untrack the file. `$ git rm --cached FILENAME`

## Pushing your .gitignore to your online repo

The last thing to consider is whether or not you would like your gitignore to be public. If you were pushing this code to a repo you expect others to be downloading and using, you might not want them to see what you're trying to ignore on your own machine. In this case, though, the instructor would like to see your gitignore file.

📝 check in your gitignore file and commit it to your repo so that it gets pushed to Github: `git add .gitignore`, `git commit -m 'add .gitignore'` and then `git push` when you are ready.

# Deliverables

You will need to push the following to your online repository for evaluation:
1. Your modified *lab2.ipynb* containing:
   * Your answers to all the above questions!
2. Your .gitignore file
   * Note that your repo cannot contain a virtual environment folder or *.ipynb_checkpoints* folder. That is, if these two folders happen to be in your repo, you should be ignoring them in your .gitignore and they should not appear on Github. Other ignored elements are optional.
   
Points awarded emphasize completion and effort over the accuracy of your results! 