**Introductory Python for ESU2018**

Welcome to the corpus text mining course! This first hands-on unit will guide you through some basic python concepts and syntax to get you all on the same page for the rest of the course. If you are familiar with other programming languages this should be super easy for you. Keep in mind, this is the bare bones of what you need for intro DH/text mining and this course and there are many more advanced versions of learning Python.

Here we will cover:
1. Commonly Used Data Structures
2. Strings and Common String Usages
3. Loops
4. Reading and Writing Files (CSV, plain text, etc.)
5. Regex 
6. Pandas




1. Commonly Used Data Structures  

Data structures are different format of keeping your data on your computer. We will go through a few most commonly used ones here.

Probably the most commonly used data structure is a list or vector.

In [None]:
#Python sytax uses brackets for lists
my_list = [1,2,3,4]

In [None]:
my_list

In [None]:
#You can declare an empty list
empty_list = []

In [None]:
#You can append thing to the end of it. Lists keep their order and repeated elements. 
my_list = [2,29.29494949,0,294,4,4,4,3,4,5,1]

In [None]:
my_list


In [None]:
my_list.append(300000000)

In [None]:
my_list

In [None]:
#And you can have different data types in a single list.
my_list.append('apples!')

In [None]:
my_list

In [None]:
#You can get the length or number of elements.
len(my_list)

In [None]:
#You can call elements by index
my_list[len(my_list)-1]

In [None]:
#Conversely, you can get the index of a given element
my_list.index('apples!')

In [None]:
numbered_list = [1,3095,39,1,304993,2,3,2,1,495820,3]

In [None]:
#You can sort your list too
sorted(numbered_list)

In [None]:
#Probably the most 'Pythonic' thing you can do with lists is slicing. You can get back a subset list by index.
sorted(numbered_list)[0:3]

In [None]:
sorted(numbered_list)[-3:len(numbered_list)]

In [None]:
sorted(numbered_list)[:len(numbered_list)]

In [None]:
sorted(numbered_list)[0:10:3]

In [None]:
#Another Pythonic thing is a list comprehension. We'll cover this later in the loops section

OK, the next is a set. A set is the same definition as used in math. It's a group of unique elements with no ordering.

In [None]:
#In Python you declare one like this.
my_set = set()


In [None]:
my_set.add(4)

In [None]:
my_set

In [None]:
my_set = {2,3,5,6}

In [None]:
my_set

In [None]:
4 in my_set

In [None]:
my_set.union({4})

In [None]:
3 in my_set.intersection({3,5})

In [None]:
my_set | {4}

In [None]:
#see documentation online for more!

Dictionaries are have a key-value pairing. You can look up your value with a key. The key are unique and unordered (like a set). Your value can be anything but your key must be a constant datatype.

In [None]:
my_dict = {'one':1, 'two':2, 'three':3}

In [None]:
my_dict['one']

Strings

In [None]:
#You can make strings in Python with quotes:
my_string = "my string"

In [None]:
len(my_string)

In [None]:
my_string.split()

In [None]:
my_string.split('s')

In [None]:
#Escaping special characters
my_string  = "\"my\" string"
my_string


In [None]:
my_string[0:4]

In [None]:
to_be_string = ["this","sentence","is","a","string"]

In [None]:
" ".join(to_be_string)

In [None]:
#stripping white spaces
my_string_with_extra_fat = "this ia a string.     "

In [None]:
my_string_with_extra_fat.strip()

In [None]:
my_string_with_extra_fat.strip().upper()


In [None]:
upper_string = "UPPER STRING"

In [None]:
upper_string.lower()

In [None]:
my_string_with_extra_fat.capitalize()

Loops

In [None]:
#for loops are the most commonly used
for _ in range(4):
    print(_)

In [None]:
for _ in {'apples','bananas','chocolate'}:
    print(_)

In [None]:
#while loops
condition = 4
while condition<100:
    print(condition)
    condition += 1

In [None]:
condition = 4
while True:
    if condition<100:
        print(condition)
        condition += 1
    else:
        break

Writing Files

In [None]:
#Python 'open' command creates a file object to read and write files
f = open('./test_file', 'w')

In [None]:
f.write("testing!")

In [None]:
f.close()

In [None]:
with open('./another_test_file','w') as f:
    f.write("testing again!")

In [None]:
with open('./a_poem','r') as f:
    poem = f.read()
    

In [None]:
poem.split("\n")

In [None]:
f = open('./a_poem','r')

In [None]:
for line in f:
    print(f.readline())

Serializing Data Structures

In [None]:
import json
a = [1,2,3]

In [None]:
json.dump(a, open('./json_dump','w'))

In [None]:
a_born_again = json.load(open('./json_dump','r'))

In [None]:
a_born_again

In [None]:
b = {'a':333, 'b':33333}

In [None]:
json.dumps(b)

In [None]:
import pickle #binary serialization. NOT human readable. Also, does not work with non-Python languages

In [None]:
a = [1,2,3]
pickle.dump(a, open('./pickle_dump','wb'))

In [None]:
pickle.dumps(a)

In [None]:
#reading a csv

In [None]:
import csv


In [None]:
with open('our_example.csv', 'w') as csvfile:
    writer = csv.writer(csvfile, delimiter='$')
    writer.writerow(['333D','puppies', 'yellow', '30029412','3.1'])
    writer.writerow(['221A','unicorn','fuscia','202293919','109.12'])
    writer.writerow(['1B','lobsters','aqua','591012219','0.19'])
    writer.writerow(['001I','dinosaurs','brown','29982358','0.911'])
    writer.writerow(['098G','donkeys','pale','02849222','9.55'])

In [None]:
with open('our_example.csv', 'r') as csvfile:
    reader = csv.reader(csvfile, delimiter='$')
    for row in reader:
        print('    '.join(row))

In [None]:
#What about by fieldnames?
with open('our_example.csv', 'r') as csvfile:
    reader = csv.DictReader(csvfile, delimiter="$")
    for row in reader:
        print(row['ID'])

Regular Expressions

In [None]:
#We'll be doing more regex later in the course! This is just a taster
import re
m = re.search(r'cats', 'whose cats are they? they are cute cats')


In [None]:
m

In [None]:
m.group(0)

In [None]:
m = re.split(r'cats', 'whose cats are they? they are cute cats')



In [None]:
m

PANDAS

In [None]:
import pandas

In [None]:
from pandas import DataFrame, read_csv

In [None]:
df = pandas.read_csv('./our_example.csv', delimiter="$")

In [None]:
df

In [None]:
df.dtypes

In [None]:
df['ANIMAL']

In [None]:
df['WEIGHT'].max()

In [None]:
%matplotlib inline  
df['WEIGHT'].plot()

In [None]:
df['WEIGHT'].plot.bar()

In [None]:
df['WEIGHT'].mean()

In [None]:
df['WEIGHT'].std()