---

### 🎓 **Professor**: Apostolos Filippas

### 📘 **Class**: Web Analytics

### 📋 **Topic**: Loops, files, and working with data

🚫 **Note**: You are not allowed to share the contents of this notebook with anyone outside this class without written permission by the professor.

---

# 🔄 1. Iterating..

A for loop in python acts as **an iterator** 
- An iterator goes into an object, and runs throught all of its items
- That object can be a sequence, mutable or immutable. Objects that we've learnt about and we can iterate over include strings, lists, dictionary keys and values.
- There are also other objects, called iterables, which we are able to iterate over. We'll learn about such an object today!

A general statement for a for loop is the following

    for item_name in object:
        do something
        
Some observations are in order.
- The variable name (item_name) that you use is up to you and makes no difference.
- The variable name can be referenced inside the loop to for example print it or check whether a condition holds.

Let's go see some examples!

## 1.1 Iterating over a list


In [None]:
my_list = [1,2,3,4,[1,2,3,4]]

for item in my_list:
    print(item)
    print('hi')

print('end of program')

In [None]:
my_list.reverse()
for item in my_list:
    print(item)

In [None]:
my_list = [1,2,3,4,5,6,7,8,9,10]

#we can also have conditions inside the loop
for item in my_list:
    if item>5:
        print(item)

In [None]:
#in another example, we only print even numbers
for item in my_list:
    if item%2 == 0:
        print(item)

In [None]:
#and we could have also had more statements such as else inside the loop
for item in my_list:
    if item%2==0:
        print(item)
    else:
        print('Number is not even')

In [None]:
#######################
#    IN CLASS EXERCISE
#######################

#  another common use for loops is summing over things
total = 0
for item in my_list:
    total += item
    print(total)

## 1.2 Iterating over a string

In [None]:
#iterating through a string, for iterates from the first to the last letter
#as one would expect!
for item in 'Learning Python':
    print(item)

## 1.3 Iterating over a dictionary

In [None]:
# when iterating over a dictionary, we iterate over its KEYS
my_dict = {'Dog':'Hund', 'Cat':'Katze', 'Mouse':'Maus'}
for k in my_dict:
    print(k)

In [None]:
my_dict = {'Dog':'Hund', 'Cat':'Katze', 'Mouse':'Maus'}
for k in my_dict:
    print(my_dict[k])

Note from the above that the order that for prints the keys of the dictionaries appear random. 

This is because as we said before, there is **no concept of index in dictionaries, as they are not sequences**.


## 1.4 Loops inside Loops

We can - of course - have a loop inside a loop. Here's an example

In [None]:
list1 = [1,3,5]
list2 = [2,4,6]

for x in list1:
    for y in list2:
        print(str(x*y))

In [None]:
list1 = [1,3,5]
list2 = [2,4,6]

for x in list1:
    if x >= 3:
        for y in list2:
            print(str(x*y))

## 1.5 Break, Continue

The break and continue statements are extremely useful to add further functionality to loops.
- **break** when the break command is executed, we break out of the current loop
- **continue** when the continue command is executed, we skip all of the other code in the current iteration of the current loop

Let's see examples.

In [None]:
my_list = [0,1,2,3,4,5,6,7,8,9,10]
for k in my_list:
    #when the number is even, we just go to the next iteration
    # hence k will not be printed
    if k%2==0:
        continue
    print(k)

In [None]:
my_list = [0,1,2,3,4,5,6,7,8,9,10]
for k in my_list:
    #when k==6 we escape the loop
    if k==6:
        break
    print(k)

## 1.6 range

The range function is one of the most useful built-in methods, and shines in the context of loops.

In Python 3, range is a generator of numbers.
- range(x,y) generates all integers from x to y-1

In [None]:
range(0,10)

Again, this is just a generator. Let's see its use in the context of for loops.

In [None]:
for k in range(0,10):
    print(k)

In [None]:
my_list = [0,100,12,3,14,5]
for k in range(0,len(my_list)):
    print(my_list[k])

## 1.7 "While" loops

While loops are pretty similar to for loops, but they just iterate until a condition is met.

For example

In [None]:
count = 0
while count < 5:
    print(count)
    count += 1  # This is the same as count = count + 1

Pretty easy, huh? :) 













---
# 📄 2. Files

The computer I was using when writing this has more than 271,925 files stored in its disk. 

Installing Python 3 added more than 3,00 files to that total. Your computer is, metaphorically speaking, drowning in files. Files are the prime storage device of every operating system, and are actually a paradigm so well-established and ingrained in our thinking that we would have a hard time imagining any alternative.

Files can be of many formats: audio, video, programs with obscure formats, excels, powerpoints, xsl, etc. You probably need to install certain modules/libraries to interact with most of the file types. Our focus for today will be the simple but powerful text files. So let's get cracking.

## 2.1 More Unix

Before we open our first file, let's look at some other useful commands.

The *mkdir* command will create a new directory.

In [None]:
%mkdir new_directory

We can check whether the directory was created with *ls*

In [None]:
ls

## 2.2 Opening and Reading a file

Let's now start working with files.

- First, go to http://bit.ly/aVerySimpleFile and download 'simple_file.txt'
- After you download it put it on your working directory


In [None]:
#open the file
my_file = open('./files/simple_file.txt', mode='r', encoding='utf-8')

A couple of notes are in order:
- the first argument of the open method is the file's path (culminating in the file's name). So far so good.
- the second argument tells python why we are opening the file. We open it to read its contents, therefore we are using the 'r' mode.
- to read a file we have to know how its contents were encoded. Most of the time the encoding type is 'utf-8', and the third argument containts this information

In [None]:
#we can now read the contents of the file
#the contents of a file are just a simple string!
my_file.read()

In [None]:
#what if we try to read the file again?
my_file.read()

When we tried to re-read the file, nothing happened. This is because the imaginary "cursor" pointed at the end of the file after reading it, so there was nothing to be read. If we want to read the file again, we have to put the cursor back to position 0.

In [None]:
#Seek to the start of file (index 0)
my_file.seek(0)

In [None]:
my_file.read()

In [None]:
#If you put the cursor at position 5, you ignore the first five characters!
my_file.seek(5)

In [None]:
my_file.read()

In [None]:
#to close a file you've oppened, simply:
my_file.close()

In [None]:
# let's go from the beggining, and assign the file string to a variable
my_file = open('files/simple_file.txt', mode='r', encoding='utf-8')
contents= my_file.read()
my_file.close()
contents

In [None]:
#if we want to have each line of the file in a separate location, 
#we can just use the split('\n') method to split in every newline character
lines = contents.split('\n')
lines

In [None]:
#Python also has a method for that called readlines()
my_file = open('files/simple_file.txt', mode='r', encoding='utf-8')
lines = my_file.readlines()
my_file.close()

lines

Even though you all will most likely be using *.CSV* files for most of your assignments in the rest of the MSBA program, it is important to learn how to read and use .txt files to focus on string manipulation. 

Having a strong coding foundation is fundamental before moving into more complex functions and libraries. 

## 2.3 Writing to a file

We now want to add a line to the file. Let's say we want to add a line saying 'files are interesting'. 

We will see how we are going to do that.

In [None]:
#open the file, read it, then close it
my_file = open('files/simple_file.txt', mode='r', encoding='utf-8')
file_contents = my_file.read()
my_file.close()

# paste your new content in the end of the string
# since we want our new content to be a new line, 
# our content should start with the new line character \n
file_contents = file_contents+'\nfiles are interesting'
file_contents

In [None]:
# we now open a new file with mode 'w', to write on it.
# opening a file that does not exist   with mode = 'w' will CREATE a new file
# opening a file that exists           with mode = 'w' will OVERWRITE the old file
#writing on a file erases its previous content - but luckily for us we stored it in the variable 'file_contents'
my_file = open('files/new_simple_file.txt', mode='w', encoding='utf-8')
my_file.write(file_contents)
my_file.close()

In [None]:
#lets see what the file now containts
my_file = open('files/new_simple_file.txt', mode='r', encoding='utf-8')
file_contents = my_file.read()
my_file.close()
file_contents

In [None]:
#alternatively, using the readlines method
my_file = open('files/new_simple_file.txt', mode='r', encoding='utf-8')
lines = my_file.readlines()
my_file.close()
lines

## 2.4 Reading data one line at a time

Python offers a convenient way to read files one line at a time.

This is done through the **for** loop, which we will dive deeper into in a bit. But as a first interaction, let's see an example

In [None]:
for line in open('files/simple_file.txt', mode='r', encoding='utf-8'):
    print(line)

Let's break down what we did above. 

We said that for every line in this text file, go ahead and print that line. Its important to note a few things:

    1.) We could have called the 'line' object anything (see example below).
    2.) By not calling .read() on the file, the whole text file was not stored in memory.

Now let's see an example of (1)

## 2.5 Closing Files Automatically

A very useful way to open files is by using the context manager. This is done by using the **with** statement.

In [None]:
with open('files/simple_file.txt', mode='r', encoding='utf-8') as file:
    for line in file:
        print(line)

The with statement creates an code block and the for statement create another code block. That is why print is indented twice (it belongs to the code block inside a code block).

- The with statement makes sure that our file will be closed without actually calling the close() command.
- The file will be closed even if the program crashes while we read the file.
- It is a (extremely) good practice to read our files using the with statement!

Context managers ensure that resources are efficiently and properly managed. The most common use case is with file handling, but context managers can be applied in numerous scenarios to handle resources.

(advanced) For those of you who are curious,
- In the code above, the open() function returns a file object, and it's this object that acts as the context manager. When the block of code under with completes, the file's __exit__ method is called, automatically closing the file, even if an exception was raised.
- You can create your own context managers, but we'd have to get into python classes to understand how!



---
# 🗂️ 3. Data Analysis - Amazon reviews
Now that we have gone over our Python basics, we are fully equipped to perform our first dataset analysis.

Our dataset is contained in the file *amazon_reviews.txt* (you can download it through http://bit.ly/someAmazonReviews

Each line contains tab-separated ('\t' character used as separator)information about a review.

In each review (and therefore in each line), we have the following information
1. ID of the product reviewed
2. ID of the reviewer
3. rating
4. space separated helpful/not helpful votes
5. summary of the review
6. review text

## 3.1 Opening and making data workable

Let's begin by opening the file and puting its contents in a list.

In [None]:
with open('files/amazon_reviews.txt', mode='r', encoding='utf-8') as f:
    reviews = f.read()

In [None]:
#reviews is now a string containing the whole file.
#to have every review separately, we will use a list and split the contents at every new line
reviews = reviews.split('\n')
#alright, let's check how many reviews we have
len(reviews)

In [None]:
#also, let's check whether everything appears to be right grabbing a couple of the reviews
reviews[500]

Everything seems to be fine, but every entry is still tab separated. 

Let's fix that (and make it easier to work with) by splitting every entry at the tab ('\t') character, and reassigning the resulting list.

In [None]:
for i in range(0,len(reviews)):
    reviews[i] = reviews[i].split('\t')

In [None]:
reviews[500]

In [None]:
reviews[150]

## 3.2 How many different Products in our data set?

Since every product has a unique id (entry1) we just have to find the number of distinct entries. We will use *sets* and a simple for loop to find them.

In [None]:
##################
#  In - CLASS
##################

products = set()
for i in range(0,len(reviews)):
    products.add(reviews[i][0])
print(len(products))

## 3.3 How many different reviewers in our data set?

We should just use the very same idea, but instead gather the distinct reviewers (entry 2). 

We will find out that the reviewers are around three times as many as the video games.

In [None]:
##################
#  In - CLASS
##################

reviewers = set()
for i in range(0,len(reviews)):
    reviewers.add(reviews[i][1])
print(len(reviewers))

## 3.4 Who are the top 50 most prolific reviewers?

To answer that question, we will have to find how many reviews each reviewer has written. 

Dictionaries are perfectly suited for this endeavor, since we can have keys to be the reviwer ids, and values to be the number of reviews.

In [None]:
##################
#  In - CLASS
#       hint: use dict
##################

num_reviews = dict()
for i in range(0,len(reviews)):
    if reviews[i][1] in num_reviews:
        num_reviews[reviews[i][1]]+=1
    else:
        num_reviews[reviews[i][1]]=1

Alright, we now have each reviewer's id and the number of reviews they have written, but we want to find out the top 50. We can think of many ways to do it, and we will follow a specific one here. Let's put everything into a list of lists!

In [None]:
sorted_reviews = []
for k in num_reviews:
    sorted_reviews.append([num_reviews[k],k])
sorted_reviews[:10]

In [None]:
sorted_reviews.sort(key=lambda x: x[0])
sorted_reviews.reverse()
sorted_reviews[:50]


## 3.5 Finding the ratings distribution

We may also find the ratings distribution - how many instances of each rating are present in our data set.

The idea here is similar to what we did before, and we will use a dictionary.

In [None]:
##################
#  In - CLASS
##################

ratings = dict()
for i in range(0,len(reviews)):
    if reviews[i][2] in ratings:
        ratings[reviews[i][2]] +=1
    else:
        ratings[reviews[i][2]] =1     

In [None]:
ratings

In [None]:
# if you wanna how it works step by step, for the first 5 reviews:

ratings = {}

for review in reviews[:5]:
    print(f"Current ratings: {ratings}")
    print(f"Current review: {review[2]}")
    if review[2] not in ratings:
        ratings[review[2]] = 1
    else:
        ratings[review[2]] += 1
    print(f"New ratings: {ratings}")
    print("-------------------------")

## 3.6 Plotting the distribution.
We can also use python to make plots! Here's a very not-fancy one. 

If you want to make it fancier you can google how matplotlib works in python!

In [None]:
import matplotlib.pyplot as plt

# the next line tells jupyter to display the picture inline---in the notebook
%matplotlib inline  

x = [1,2,3,4,5]
y = [2599,2894, 5918, 12533, 26056]
plt.bar(x, y)
plt.xlabel("Rating")
plt.ylabel("Number of Reviews")
plt.title("Distribution of Ratings")

# uncomment if you want to save your figure!
#plt.savefig("files/a_figure.png",dpi=72)

## Quiz: Correlation between review length and rating

In [None]:
# get all review lengths
review_lengths = [len(review[5]) for review in reviews]
# create a dictionary with the lengths as keys and the number of reviews with that length as values
reviews_by_rating = {'1.0':[], '2.0':[], '3.0':[], '4.0':[], '5.0':[]}

for i in range(len(reviews)):
    rating = reviews[i][2]
    reviews_by_rating[rating].append(len(reviews[i][5]))

avg_lengths = {}
for rating, lengths in reviews_by_rating.items():
    avg_lengths[rating] = sum(lengths) / len(lengths)



In [None]:
avg_lengths

In [None]:
import matplotlib.pyplot as plt

ratings = list(avg_lengths.keys())
avg_len_values = list(avg_lengths.values())

plt.bar(ratings, avg_len_values, color='blue', alpha=0.7)
plt.xlabel("Rating")
plt.ylabel("Average Review Length")
plt.title("Average Review Length by Rating")
plt.show()
