# Google Colab Basics

Google Colab is a browser-based platform that allows you to run Python code from the web. It is very similar to Jupyter, which you can use locally on your computer or in the cloud. Basically, a notebook will behave just as if you were starting a new Terminal ([Windows](https://docs.microsoft.com/en-us/windows/terminal/get-started), [Mac](https://support.apple.com/guide/terminal/welcome/mac), [Linux](https://ubuntu.com/tutorials/command-line-for-beginners#1-overview)) window on your computer.

The fundamental units of a Colab notebook are _cells_ and their outputs. A _cell_ is just a code block that contains some Python, which behave exactly as regular Python code. A cell's outputs can be stored within the notebook and are just like storing variables in the terminal. If you have ever used  `python` in Terminal, or run `ipython` to get a more interactive, user-friendly python, you are basically doing the same thing as a notebook.

The main thing that distinguishes between a notebook and a python (`.py`) file is whether the notebook contains Markdown, javascript, and HTML. In this course, we will use "text cells" that you can use Markdown for to answer questions _without_ code, and code cells which will rely on Python to answer questions that _require_ code. 

In [1]:
# This is a code cell. In it, you can write any valid Python.
# You do not need to double click to see the contents of this cell.
# Remember, the # means that this is a "comment" so in this cell nothing will be output.

This is a text cell. In it, you can write any valid Markdown. Double click on this cell to see the contents of the markdown.

In [2]:
#Google Colab is a browser-based platform for running Python code. The interface is very similar to Jupyter.

#To execute a code cell, press the 'play' button in the upper left corner, or hit Ctr + Enter
print('Congratulations! You\'ve run your first code snippet in Colab!')

Congratulations! You've run your first code snippet in Colab!


In [3]:
print('Congratulations! You\'ve run your first code snippet in Colab!'

SyntaxError: ignored

## Shell commands

Colab also has a bash/shell environment. You can navigate through folders using exclamation points to indicate bash (terminal-style) code.

If you are unfamiliar with shell commands, here is a brief overview:

* [`ls`](https://linuxize.com/post/how-to-list-files-in-linux-using-the-ls-command/) shows you the contents of a folder
* [`cd`](https://linuxize.com/post/basic-linux-commands/#changing-directory-cd-command) lets you <u>**c**</u>hange <u>**d**<u>irectories into any folder on the computer.
* `less` followed by a filename (e.g., `less example.txt`) shows you the contents of a file on the computer.

In [4]:
!ls

sample_data


In [5]:
!ls ..

bin	 datalab  home	 lib64	opt	    root  srv		     tmp    var
boot	 dev	  lib	 media	proc	    run   sys		     tools
content  etc	  lib32  mnt	python-apt  sbin  tensorflow-1.15.2  usr


In [7]:
!ls sample_data

anscombe.json		      mnist_test.csv
california_housing_test.csv   mnist_train_small.csv
california_housing_train.csv  README.md


In [9]:
!less sample_data/california_housing_train.csv

7[?47h[?1h="longitude","latitude","housing_median_age","total_rooms","total_bedrooms","popu lation","households","median_income","median_house_value"
-114.310000,34.190000,15.000000,5612.000000,1283.000000,1015.000000,472.000000,1 .493600,66900.000000
-114.470000,34.400000,19.000000,7650.000000,1901.000000,1129.000000,463.000000,1 .820000,80100.000000
-114.560000,33.690000,17.000000,720.000000,174.000000,333.000000,117.000000,1.65 0900,85700.000000
-114.570000,33.640000,14.000000,1501.000000,337.000000,515.000000,226.000000,3.1 91700,73400.000000
-114.570000,33.570000,20.000000,1454.000000,326.000000,624.000000,262.000000,1.9 25000,65500.000000
-114.580000,33.630000,29.000000,1387.000000,236.000000,671.000000,239.000000,3.3 43800,74000.000000
-114.580000,33.610000,25.000000,2907.000000,680.000000,1841.000000,633.000000,2. 676800,82400.000000
-114.590000,34.830000,41.000000,812.000000,168.000000,375.000000,158.000000,1.70 8300,48500.000000
-114.590000,33.610000,34

In [10]:
!pip install nltk



In [11]:
!pip install datasets

Collecting datasets
  Downloading datasets-1.11.0-py3-none-any.whl (264 kB)
[K     |████████████████████████████████| 264 kB 5.4 MB/s 
Collecting fsspec>=2021.05.0
  Downloading fsspec-2021.8.1-py3-none-any.whl (119 kB)
[K     |████████████████████████████████| 119 kB 47.3 MB/s 
Collecting huggingface-hub<0.1.0
  Downloading huggingface_hub-0.0.16-py3-none-any.whl (50 kB)
[K     |████████████████████████████████| 50 kB 6.5 MB/s 
Collecting xxhash
  Downloading xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243 kB)
[K     |████████████████████████████████| 243 kB 41.5 MB/s 
Installing collected packages: xxhash, huggingface-hub, fsspec, datasets
Successfully installed datasets-1.11.0 fsspec-2021.8.1 huggingface-hub-0.0.16 xxhash-2.0.2


Now, make a file called `example.txt` that contains a short sentence, such as, "The football team of Buffalo, NY is the Buffalo Bills."

You can create this file in any text editor (TextEdit on Macs, Notepad on Windows, and your favorite for any Linux users here.

In [12]:
#To upload a file from your local machine:
from google.colab import files
uploaded = files.upload()

Saving abstracts.tsv to abstracts.tsv


In [22]:
#once a file is uploaded to colab, you can read it into python:
uploaded['abstracts.tsv'].decode('utf-8').split("\n")[0:5]

['Offensive language detection (OLD) has received increasing attention due to its societal impact. Recent work shows that bidirectional transformer based methods obtain impressive performance on OLD. However, such methods usually rely on large-scale well-labeled OLD datasets for model training. To address the issue of data/label scarcity in OLD, in this paper, we propose a simple yet effective domain adaptation approach to train bidirectional transformers. Our approach introduces domain adaptation (DA) training procedures to ALBERT, such that it can effectively exploit auxiliary data from source domains to improve the OLD performance in a target domain. Experimental results on benchmark datasets show that our approach, ALBERT (DA), obtains the state-of-the-art performance in most cases. Particularly, our approach significantly benefits underrepresented and under-performing classes, with a significant improvement over ALBERT.',
 'Hate speech and profanity detection suffer from data spar

In [23]:
#To connect your google drive to colab:
from google.colab import drive

#note: colab will prompt you for an authorization code. Click the link that pops up and paste the code in the box below.
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [24]:
#now you can access the files in your google drive from colab
%cd drive/My Drive
!ls

/content/drive/My Drive
Teaching


In [28]:
!ls Teaching/Fall2021/Computational\ Linguistics/Lectures

intro_to_colab.ipynb  intro_to_python.ipynb


In [29]:
%cd Teaching/Fall2021/Computational\ Linguistics/Lectures/supplementary_files

/content/drive/My Drive/Teaching/Fall2021/Computational Linguistics/Lectures/supplementary_files


In [31]:
#once a file is uploaded to colab, you can read it into python:
with open('abstracts.tsv','r') as example_file:
  print(example_file)

<_io.TextIOWrapper name='abstracts.tsv' mode='r' encoding='UTF-8'>


In [32]:
abstracts = open('abstracts.tsv', 'r').readlines()

In [33]:
abstracts[0:5]

['Offensive language detection (OLD) has received increasing attention due to its societal impact. Recent work shows that bidirectional transformer based methods obtain impressive performance on OLD. However, such methods usually rely on large-scale well-labeled OLD datasets for model training. To address the issue of data/label scarcity in OLD, in this paper, we propose a simple yet effective domain adaptation approach to train bidirectional transformers. Our approach introduces domain adaptation (DA) training procedures to ALBERT, such that it can effectively exploit auxiliary data from source domains to improve the OLD performance in a target domain. Experimental results on benchmark datasets show that our approach, ALBERT (DA), obtains the state-of-the-art performance in most cases. Particularly, our approach significantly benefits underrepresented and under-performing classes, with a significant improvement over ALBERT.\n',
 'Hate speech and profanity detection suffer from data sp

### Package installation

For example, NLTK: https://pypi.org/project/nltk/

In [None]:
#you can use pip to install packages
!pip install nltk

In [None]:
#now you can import the package (note: you will need to reinstall packages everytime you start a new runtime)
import nltk

```
from nltk.tokenize import word_tokenize

print(word_tokenize("This is a sentence"))
```

## All of my numbers are messed up. How do I start over?

You will need to restart the runtime. To get a completely clean slate, go to
`Runtime > Factory reset runtime`. This will reset all of your cells to having never been run and will uninstall any new packages that you installed. 

If you want to just start over, do `Runtime > Restart runtime` instead.

If you want to erase all of the output of those previously run cells, you should also go to `Edit > Clear all outputs`.

## Many standard python packages are automatically installed when you start a Colab instance.

```
import numpy as np
import pandas as pd
```

Everything you write in a previously-run cell will influence the next cell. So, every variable you include will be "known" to all subsequent cells. However, you should try to keep cells self-contained. For example, in our homework assignments we will have separate questions, which will all be answered within a single cell.

# About Markdown

Markdown cells allow you to write simple, HTML-like text without a need for HTML tags. It can help make your notebook look nicer. Here, a Markdown (Text) cell in Colab lets us write _exposition_ about our decisions, _motivation_ for a particular analysis, or _summaries_ of the results of previous analyses. In some variants of Markdown, you can include fancy formatting, including LaTeX.

To make pleasant-looking headings, use `###` before them. `##` will be a similar heading, but slightly larger, and `#` will be the largest heading.

### A cute heading

## Another cute one

# A bigge boye heading

If you want to write something in italics, you can use one asterisk \* before and after what you want to italicize. For example: *this is a sentence*. If you use two \*\*, you will make boldface. For example: **this is also a sentence**. If you want to strike something out, you can use \~\~ before and after, à la: ~~Pretend you didn't see this sentence~~

You can also embed fake code into your Markdown cells by using three backward ticks: \`\`\`

Double click on this cell to see how this looks!

```
list_of_words = ['This', 'is', 'a', 'sentence']
for word in list_of_words:
    print(word)
```

## $\LaTeX$ in Markdown

### Bayes' rule
$p(A|B) = \frac{p(B|A) * p(A)}{p(B)}$

Using $\LaTeX$ in Markdown may come in handy when you want to explain a formula you are trying to turn into code in later assignments.

In [34]:
import pandas as pd

In [39]:
!ls /content/sample_data

anscombe.json		      mnist_test.csv
california_housing_test.csv   mnist_train_small.csv
california_housing_train.csv  README.md


In [40]:
df = pd.read_csv("/content/sample_data/california_housing_train.csv", sep=",")

In [41]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


In [43]:
[x.split() for x in abstracts[0:5]]

[['Offensive',
  'language',
  'detection',
  '(OLD)',
  'has',
  'received',
  'increasing',
  'attention',
  'due',
  'to',
  'its',
  'societal',
  'impact.',
  'Recent',
  'work',
  'shows',
  'that',
  'bidirectional',
  'transformer',
  'based',
  'methods',
  'obtain',
  'impressive',
  'performance',
  'on',
  'OLD.',
  'However,',
  'such',
  'methods',
  'usually',
  'rely',
  'on',
  'large-scale',
  'well-labeled',
  'OLD',
  'datasets',
  'for',
  'model',
  'training.',
  'To',
  'address',
  'the',
  'issue',
  'of',
  'data/label',
  'scarcity',
  'in',
  'OLD,',
  'in',
  'this',
  'paper,',
  'we',
  'propose',
  'a',
  'simple',
  'yet',
  'effective',
  'domain',
  'adaptation',
  'approach',
  'to',
  'train',
  'bidirectional',
  'transformers.',
  'Our',
  'approach',
  'introduces',
  'domain',
  'adaptation',
  '(DA)',
  'training',
  'procedures',
  'to',
  'ALBERT,',
  'such',
  'that',
  'it',
  'can',
  'effectively',
  'exploit',
  'auxiliary',
  'data',
 

In [45]:
my_list = []
for x in abstracts[0:5]:
  my_list.append(x.split())
my_list

[['Offensive',
  'language',
  'detection',
  '(OLD)',
  'has',
  'received',
  'increasing',
  'attention',
  'due',
  'to',
  'its',
  'societal',
  'impact.',
  'Recent',
  'work',
  'shows',
  'that',
  'bidirectional',
  'transformer',
  'based',
  'methods',
  'obtain',
  'impressive',
  'performance',
  'on',
  'OLD.',
  'However,',
  'such',
  'methods',
  'usually',
  'rely',
  'on',
  'large-scale',
  'well-labeled',
  'OLD',
  'datasets',
  'for',
  'model',
  'training.',
  'To',
  'address',
  'the',
  'issue',
  'of',
  'data/label',
  'scarcity',
  'in',
  'OLD,',
  'in',
  'this',
  'paper,',
  'we',
  'propose',
  'a',
  'simple',
  'yet',
  'effective',
  'domain',
  'adaptation',
  'approach',
  'to',
  'train',
  'bidirectional',
  'transformers.',
  'Our',
  'approach',
  'introduces',
  'domain',
  'adaptation',
  '(DA)',
  'training',
  'procedures',
  'to',
  'ALBERT,',
  'such',
  'that',
  'it',
  'can',
  'effectively',
  'exploit',
  'auxiliary',
  'data',
 