# Practical 3 - Python Basics II: Strings and Functions

The aim of this practical is to get you familar with some more advanced concepts of the Python programming language. These problems are designed to get you familar with functions, file input/output, string manipulation, lists, tuples and dictionaries.

There are 2 sections, each containing 4 problems. You are encouraged to attempt as many as possible, but you are only required to demonstrate **one problem from each section** for marking. You need to explain to your tutor how you have solved the problem and answer any questions they may ask.

## Contents
1. Section A
    - [Problem A1](#Problem-A1:-Calculating-List-Properties): Calculating List Properties
    - [Problem A2](#Problem-A2:-Calculating-Vector-Products): Calculating Vector Products
    - [Problem A3](#Problem-A3:-Parsing-and-Counting-Votes): Parsing and Counting Votes
    - [Problem A4](#Problem-A4:-Selective-Capitalisation): Selective Capitalisation
2. Section B
    - [Problem B1](#Problem-B1:-File-Input-and-Output): File Input and Output
    - [Problem B2](#Problem-B2:-Complex-Dictionaries): Complex Dictionaries
    - [Problem B3](#Problem-B3:-Searching-Files): Searching Files
    - [Problem B4](#Problem-B4:-File-Word-Count): File Word Count

## Section A
### Problem A1: Calculating List Properties

Write a program to calculate the sum, maximum and minimum values of a list. You need to define your own functions for calculating the above values instead of using the built-in Python functions. A skeleton of your program is provided below and you need to complete the empty functions.

In [5]:
# function definitions
def mysum(values_list):
    # your code goes here to calculate and return the sum of values in list values_list
    sum=0
    for x in values_list:
        sum+=x
    return sum 

def mymax(values_list):
    # your code does here to calculate and return the maximum value in list values_list
    a=0
    for i in values_list:
        if i>a:
            a=i
    return a

def mymin(values_list):
    # your code does here to calculate and return the minimum value in list values_list
    a=values_list[0]
    for i in values_list:
        if i<a:
            a=i
    return a
    


The input list, `values_list`,  is defined in the cell below for you, along with a print function to print the return values of the functions you have defined above.

Some hints for the function definitions:

- `mysum`:
    - loop over the elements of `values_list` and add each value to a variable that is recording the total.
    - return the total.
- `mymax`:
    - loop over the elements of `values_list`.
    - use a variable to record the largest number the loop has encountered.
    - before looping over `values_list`, you may wish to set the variable to be the first element of `values_list`, i.e. `values_list[0]`.
    - return the variable once the loop has finished.
- `mymin`:
    - use the same strategy as `mymax` except record the lowest value instead of the largest.

You do not need to complete any code in the cells below. You only need to run them and verify they produce the correct values. Remember that if you changed the code in the cells above containing the function definitions, you will need to re-run the definition cell before running the cells below.

The expected output for the cell below is:
```
Sum: 45 Max: 9 Min: 0
```

In [7]:
x = range(10)  # x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
print("Sum:", mysum(x), "Max:", mymax(x), "Min:", mymin(x))

Sum: 45 Max: 9 Min: 0


The expected output of the cell below is:
```
Sum: 25 Max: 9 Min: 1
```

In [9]:
x = range(1,10,2)  # x = [1, 3, 5, 7, 9]
print("Sum:", mysum(x), "Max:", mymax(x), "Min:", mymin(x))

Sum: 25 Max: 9 Min: 1


The expected output of the cell below is:
```
Sum: 1023.0 Max: 512.0 Min: 1.0
```

In [11]:
x = [2.0**i for i in range(10)]  # x = [1.0, 2.0, 4.0, 8.0, ..., 512.0]
print("Sum:", mysum(x), "Max:", mymax(x), "Min:", mymin(x))

Sum: 1023.0 Max: 512.0 Min: 1.0


### Problem A2: Calculating Vector Products

Write a program to calculate the normal of a vector, as well as the inner product and distance between two vectors. A vector is a mathematical term that can be represented as a list of numbers.

Let $x=[x_1, x_2, ..., x_n]$, $y = [y_1, y_2, ..., y_n]$ be two vectors of size $n$. The **normal** of vector $x$ is given by:

$$normal(x) = \sqrt{\sum_{i=1}^{n} x^{2}_{i}} = \sqrt{x_{1}^{2} + x_{2}^{2} + \ldots + x_{n}^{2}}$$

The **inner product** between vectors $x$ and $y$ is given by:

$$innerprod(x, y) = \sum_{i=1}^{n} x_i y_i = x_1 y_1 + x_2 y_2 + \ldots + x_n y_n$$

The **distance** between vectors $x$ and $y$ is given by:

$$dist(x, y) = \sqrt{norm(x)^2 + norm(y)^2 - 2 \times innerprod(x, y)}$$

A skeleton of your program is provided below and you need to complete the empty functions. (Hint: square root can be calculated by using math library sqrt function)

In [13]:
from math import sqrt  # importing the sqrt function for use

def normal(vector):
    # your code goes here to calculate and return the normal of the vector
    sum1=0
    for x in vector:
        sum1=x**2+sum1
    return sqrt(sum1 )


def innerproduct(vector_x, vector_y):
    # your code goes here to calculate and return the inner product of vector_x and vector_y
    # you may assume that vector_x and vector_y have the same number of values

    sum1=0
    len1=len(vector_x)
    for i in range(len1):
            sum1+=vector_x[i]*vector_y[i]
    return sum1
    
def distance(vector_x, vector_y):
    # your code goes here to calculate the return the distance between vectors vector_x and vector_y
    # you may assume that vector_x and vector_y have the same number of values

    squared_distance = normal(vector_x) **2 + normal(vector_y)**2 - 2 * innerproduct(vector_x, vector_y) 
    return sqrt(squared_distance)  # Return the square root of the squared distance for the Euclidean distance


The input vectors, `vector_x` and `vector_y`, are defined as lists in the cell below for you, along with a print function to print the return values of the functions you have defined above. The expected output for this cell is
```
norm(vector_x): 5.4772 norm(vector_y): 5.4772
```

In [15]:
vector_x = range(5)
vector_y = range(4, -1, -1)
print("normal(vector_x):", normal(vector_x), "normal(vector_y):", normal(vector_y))

normal(vector_x): 5.477225575051661 normal(vector_y): 5.477225575051661


The expected output for the cell below is:
```
innerproduct(vector_x, vector_y): 10
```

In [17]:
print("innerproduct(vector_x, vector_y):", innerproduct(vector_x, vector_y))

innerproduct(vector_x, vector_y): 10


The expected output for the cell below is:
```
distance(vector_x, vector_y): 6.3246
```

In [19]:
print("distance(vector_x, vector_y):", distance(vector_x, vector_y))

distance(vector_x, vector_y): 6.324555320336759


### Problem A3: Parsing and Counting Votes

Write a function that will count the votes from an input string. The votes are represented by 'Y's and 'N's separated by commas ','. Your function should handle both uppercase and lowercase 'Y's and 'N's. Moreover, there might be extra spaces in the front and/or end of each vote. Your function should display a message for the voting results which must contain the result of the vote (either accepted or rejected), the number of yes votes, the number of no votes, and the total number of votes. The vote is accepted only if the number of yes votes exceeds the number of no votes. The skeleton program is provided below and you need to complete the function `count_votes`

In [35]:
def count_votes(votes):
    votes = votes.upper().split(',')
    cleaned_votes = [vote.strip() for vote in votes]
    yes = cleaned_votes.count("Y")
    no = cleaned_votes.count("N")
    total = yes + no
    if yes > no:
        result="Vote accepted"
    else:
        result="Vote rejected"
    print(f"{result}: {total} votes in total, {yes} accepted and {no} rejected.")

The input votes are defined below for you as the string `votes`. Notice how there are both uppercase and lowercase votes, as well as a variable number of spaces between the votes that your `count_votes` function will need to handle correctly. A call to the `count_votes` function is also provided.

_Hint:_ one way to complete the `count_votes` function is to use the `str.split(sep)` method, passing in `','` as the _sep_ parameter. Then you may loop over the elements of the result and determine if each element contains a 'Y' or 'y' to count it as a yes vote, or 'N' or 'n' for no votes.

The cell below should produce something similar the the following output:
```
Vote rejected: 20 votes in total, 7 accepted and 13 rejected.
```
Depending on how you wrote your solution, your output may be slightly different, however the numerical results should be the same as well as the vote result (either accepted or rejected).

In [37]:
votes = "N , Y, Y,N,n , N , N , N ,n ,y, n,N,Y, y,Y,N , N , n ,y, N"
count_votes(votes)

Vote rejected: 20 votes in total, 7 accepted and 13 rejected.


The expected output for the cell below should be similar to:
```
Vote accepted: 20 votes in total, 11 accepted and 9 rejected.
```

In [39]:
votes = " y,n,Y ,y,y ,n, N, Y,N , N,y , n, Y,Y,y, N ,y, n, n , Y "
count_votes(votes)

Vote accepted: 20 votes in total, 11 accepted and 9 rejected.


### Problem A4: Selective Capitalisation

Write a program to capitalise each word in a phrase unless the word is an escaping word like a, an, the, am, is, are, and, of, in, on, with, from, to. For example, if given the input message "I am a good player", your program should return "I am a Good Player" at output. A skeleton program is provided in the following cell and you need to complete the empty function for word capitalisation. 

_Hint: you can create a list of the escape words given above and decide if a new word is an escape word by searching the list_

In [42]:
def capitalise(phrase):
    escaping=["a","an","the","am","is","are","and","of","in","on","with","from","to"]
    a=phrase.split()
    for i in range(len(a)) :
        if a[i] not in escaping:
            a[i]=a[i].upper()
    return " ".join(a)

Some calls to the `capitalise` function you have defined above are provided for you below. The expected output of the cell below is:
```
I am an Educator and a Researcher
Big Data is the Future of Information Technology
He Wants to Have Breakfast with Her in the Hotel
```

In [45]:
capitalise("I am an educator and a researcher")
capitalise("big data is the future of information technology")
capitalise("He wants to have breakfast with her in the hotel")

'HE WANTS to HAVE BREAKFAST with HER in the HOTEL'

In [47]:
capitalise("big data is the future of information technology")

'BIG DATA is the FUTURE of INFORMATION TECHNOLOGY'

In [49]:
capitalise("I am an educator and a researcher")

'I am an EDUCATOR and a RESEARCHER'

## Section B
### Problem B1: File Input and Output

In this problem you are to you will be given a text file (scemunits.txt) that saves the information about some units offered in SCEM. The text file contains three columns separated by ',', that correspond to the unit ID, name and the course it belongs to.  The first line of the text file contains the header information that can be skipped for file processing. You must download the text file from vUWS and upload it to the Jupyter Notebook server in the same manner you upload the notebook .ipynb files.

Write a program that reads the text file and saves all units from the MICT course in a separate text file.  A skeleton program is provided below and you need to complete the `read_write_file()` function.  

_Hint: you can use a for loop to read the lines from infile and check whether that line corresponds to a record for an MICT unit. Save that line to outfile if this is the case and discard the line otherwise. You can use `readline()` to skip the first line of infile._

Some suggested functions to look into to help you are:
- `open( file_to_open )`
- `readline()`
- `write( thing_to_write )`

An example of reading and writing files in Python can be found here: https://docs.python.org/3/tutorial/inputoutput.html

The basic flow of the `read_write_file` function would involve:
1. Open the `infile` for reading, and open the `outfile` for writing.
2. Discard the first line
3. Loop through the remaining lines
    - If 'MICT' in line
        - Write to `outfile`
    - Otherwise
        - Ignore the line
4. Close both `infile` and `outfile`.

In [69]:
def read_write_file(infile, outfile):
    # Your code goes here to read the content of infile, pick up records of MICT units, 
    # and save the results to outfile
    infile=open("scemunits.txt","r")
    outfile=open("mictunits.txt","w")
    infile.readline()
    for line in infile:
        unitid,unitname,course=line.strip().split(",")
        course = course.strip()
        if course=="MICT":
            outfile.write(line)
    infile.close()
    outfile.close()

read_write_file("scemunits.txt", "mictunits.txt")

### Problem B2: Complex Dictionaries

This exercise is similar to the above one except that the unit information is provided in a dictionary. You will write a function that accepts the unit list in a dictionary variable, and a keyword string that saves the course information. The function should display all units in a course that matches the keyword.   

The `units` variable defined below is a Python dictionary object. It is an object that maps a `key` to a `value`. For instance:
- `key1` -> `value1`
- `key2` -> `value2`
- `key3` -> `value3`

If you wanted to use `value1` you would use `key1` to access it. For example `units['key1']`.

A dictionary object has a function called `keys()` (which can be accessed as `units.keys()`). It will let you loop through each `key` that is stored in the dictionary. It may then be used to access the value. You should use this built in `keys()` function to loop through and test the value of each key to see if it is equal to the `keyword`. 

In [89]:
units = {
    ('301046','Big Data'): 'MICT',
    ('300581', 'Programming Techniques'): 'BICT',
    ('300144', 'OOA'): 'BICT',
    ('300103', 'Data Structures'): 'BCS',
    ('300147', 'OOP'):'BCS',
    ('300569', 'Computer Security'): 'BIS',
    ('301044', 'Data Science'): 'MICT',
    ('300582', 'TWA'): 'BICT'
}
for key in units.keys():
    if units[key]=="MICT":
        print(key)

('301046', 'Big Data')
('301044', 'Data Science')


In [101]:
def display_units(units, keyword):
    # Your code goes here to pick up all records from the list of units that belong
    # to the course specified by the keyword and display the result on screen
    for key in units.keys():
        if units[key]==keyword:
            unitid,unitname=key
            print(unitid,unitname)

units = {
    ('301046','Big Data'): 'MICT',
    ('300581', 'Programming Techniques'): 'BICT',
    ('300144', 'OOA'): 'BICT',
    ('300103', 'Data Structures'): 'BCS',
    ('300147', 'OOP'):'BCS',
    ('300569', 'Computer Security'): 'BIS',
    ('301044', 'Data Science'): 'MICT',
    ('300582', 'TWA'): 'BICT'
}
# the function below should display all MICT units
# 301046 Big Data
# 301044 Data Science
display_units(units, 'MICT')
# the function below should display all BCS units
# 300103 Data Structure
# 300147 OOP
display_units(units, 'BCS')

301046 Big Data
301044 Data Science
300103 Data Structures
300147 OOP


### Problem B3: Searching Files

Grep is an important utility program in the Linux operating system that shows all lines in a file that contain certain words or expressions.  In this program, you will write a simple grep program. You need to define a grep function in the following, where `filename` is the name of input file, and `expr` is the expression to be searched for each line of the input file. The defined grep function can then be used to search all occurrences of given expression in a file. Note that example test file _bigdata.txt_ is available on vUWS.

In [159]:
def grep(filename, expr): 
    # your code goes here to complete the grep functionality
    infile=open(filename, "r")
    infile.readline()
    for line in infile:
        if expr in line:
            print(line.strip())
    infile.close()
# display all lines from bigdata.txt containing Big data
grep("bigdata.txt", "Big data")
# display all lines from bigdata.txt containing technology
grep("bigdata.txt", "technology")

Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture,
curate, manage, and process data within a tolerable elapsed time. Big data "size" is a constantly moving
target, as of 2012 ranging from a few dozen terabytes to many petabytes of data. Big data is a set of
definition as follows: "Big data is high volume, high velocity, and/or high variety information assets that
Big data uses inductive statistics and concepts from nonlinear system identification to infer laws
Big data can also be defined as "Big data is a large volume unstructured data which cannot be handled by
Big data can be described by the following characteristics: Volume ï¿½ The quantity of data that is generated is
Big data analytics consists of 6 Cs in the integrated industry 4.0 and Cyber Physical Systems environment. 6C
Big data requires exceptional technologies to efficiently process large quantities of data within tolerable
Bus wrapped with SAP Big data parked out

The basic workflow of this problem is much the same as the ones above. You will need to open the `filename` and discard the first line, and then loop through the rest. To test if the `expr` term is in the current line, use the `if <expr> in <line>` type syntax. If this is `True` then you want to `print()` the line.  

### Problem B4: File Word Count

For this exercise, you will complete a program that displays top 10 most frequently occurring words in an input file. We will use the same test file _bigdata.txt_ as used in a previous problem. A skeleton program is provided in the following cell, where the sorting and displaying functionality has already been implemented. You need to complete the `create_wordcount_dict` function which creates a dictionary of key value pairs given input file name passed to the function. Each entry in the dictionary has a key, which is a word that appeared in the input file, and a value, which is the number of occurrences of that word in the input file. Your program should return this dictionary. Note that any return value of your function other than dictionary will cause the rest part of the program to fail. 

#### Hints
You can create an empty dictionary with either of the following syntax: `d = dict()` or `d = {}`.

Adding new key-value pairs to an existing dictionary uses the same syntax as updating an existing entry: `d['newKey'] = value`. However, if you wish to access an existing entry, you need to make sure the key already exists in the dictionary. e.g. `d['newKey'] = d['newKey'] + 1` will fail if `newKey` is not already in the dictionary as the right-hand-side of the assignment is attempting to _access_ a non-existant entry.

You can check if a key already exists in a dictionary by using the `in` operator. e.g.

```python
d = {'building': 'EB', 'floor': '1', 'room': 48}
if 'floor' in d:
    print("The floor is", d['floor'])
```



In [191]:
import string
def create_wordcount_dict(filename):
    # Your code goes here to create and return a dictionary
    with open(filename,"r") as infile:
        infile.readline()
        d=dict()
        for line in infile:
            for word in line.strip().split():
                word = word.strip(string.punctuation).lower()
                d[word]=d.get(word,0)+1
    return d

wordcount = create_wordcount_dict("bigdata.txt")

# sort the entries in the dictionary by their values in descending order i.e. value in the key-value pair
# note that the return value is a list of keys only which is assigned to sorted_keys_list 
sorted_keys_list = sorted(wordcount, key=wordcount.get, reverse=True)

# print the top 10 entries in the sorted list
for key in sorted_keys_list[:10]:
    print(key, wordcount[key])

the 359
data 236
of 226
and 219
to 180
in 144
big 107
a 102
is 83
as 71
