### MY470 Computer Programming

### Week 3 Assignment, MT 2021

#### \*\*\* Due 12:00 noon on Monday, October 18 \*\*\*

---
### Working with data files

For this assignment, we will use data from the file ca-GrQc.txt. The file contains the co-authorship links for articles in the ArXiv category General Relativity. Each line in the file includes the ids of two authors who have worked together on at least one article. In network analysis parlance, this is known as an "edge list". The data are obtained from the [Stanford Large Network Dataset Collection](https://snap.stanford.edu/data/index.html) and you can find more information about them on https://snap.stanford.edu/data/ca-GrQc.html.

#### Hints

The problems below need to be done in sequence because objects (lists, dictionaries, etc.) you create in early problems may be needed for a later problem. However, if you don't manage to obtain these objects at the beginning, just use fictitious ones, e.g. `[13, 14, 22, 24, 25, 26, 27, 28, 29, 45]` or `{13: [13, 7596, 11196, 19170], 14: [14171]}`. 

### Problem 1: Get all coauthorships in a list of lists

Create a list that contains all edges included in the data file as lists of the two authors' ids, where the ids are saved as integers. Your list should look like [[3466, 937], [3466, 5233], ...]. To achieve this, use a `for` loop to iterate over each line in the file. One way to do this is as follows:

```
for line in open('ca-GrQc.txt', 'r'):
    do something with line
```

⚡️ Notice that this is a more efficient way to read data than `file.read()`, which we used in Assignment 2, as you don't load all data in memory but stream them line by line. 

Print the first 10 entries in your list. 

#### Hints

It is a good practice to write and test your initial code using a smaller version of the dataset. This will help you debug faster. It will also allow you to manually check for possible errors. 

You need to ignore the first four lines of the file that contain explanatory text.

In the file, the two author ids are separated with tabs and the tab character is encoded as `'\t'`.


In [1]:
# Enter your answer to Problem 1 here. 
new_list = []

for line in open("ca-GrQc.txt", "r"):
    if line[0].isnumeric() == 1:
        line = line.replace('\n', '')
        line = line.split('\t')
        line = [int(x) for x in line]
        new_list.append(line)
print(new_list[:10])    


[[3466, 937], [3466, 5233], [3466, 8579], [3466, 10310], [3466, 15931], [3466, 17038], [3466, 18720], [3466, 19607], [10310, 1854], [10310, 3466]]


### Problem 2: Who are the authors in the data?

Create a sorted list with the integer ids for all of the unique authors in the dataset. Print the first 10 authors in the list. Then print how many authors there are in total.

Then, using a dictionary comprehension, create a dictionary in which the keys are the author integer ids and the values are empty lists. The dictionary should look something like: `{13: [], 14: [], 22: [], ...}`. To confirm, print the dictionary values for the authors in the list `[13, 14, 22, 24, 25, 26, 27, 28, 29, 45]`.

#### Hints

Note that if the edge *i–j* is in the data, then the edge *j–i* is also there. This means that for this task you don't need to consider the second author in the line. You can get all authors by collecting just the first author in each line in the file.

In [2]:
# Enter your answer to Problem 2 here. 

first_author_list = [i[0] for i in new_list]
#print(first_author_list)

unique_words = set(first_author_list)
list_of_unique_words = list(unique_words)
print(list_of_unique_words[:10])
print('Number of unique authors :', len(list_of_unique_words))

authors_dic = {i:[] for i in list_of_unique_words}
for i in list_of_unique_words[:10]: 
    print(i, ':', authors_dic[i])

[13, 14, 22, 24, 25, 26, 27, 28, 29, 45]
Number of unique authors : 5242
13 : []
14 : []
22 : []
24 : []
25 : []
26 : []
27 : []
28 : []
29 : []
45 : []


---
### Problem 3: Get each author's coauthors

Enter each author's unique coauthors in the empty dictionary you created in Problem 2. The dictionary should now look something like: `{13: [7596, 11196, 19170], 14: [14171], ...}`.

Print the list of coauthors for the authors in the list `[13, 14, 22, 24, 25, 26, 27, 28, 29, 45]`.

#### Hints

Notice that the data contain errors. For example, I noticed that the data say that author 13 coauthored with himself/herself, which is meaningless. To get the maximum number of points, your code should exclude oneself in one's list of coauthors.

In [3]:
# Enter your answer to Problem 3 here. 

for i in new_list:
    if i[0] != i[1]:
        authors_dic[i[0]].append(i[1])

#print(authors_dic) to check 

for i in [13, 14, 22, 24, 25, 26, 27, 28, 29, 45]: 
    print(i, ':', authors_dic[i])

13 : [7596, 11196, 19170]
14 : [14171]
22 : [106, 11183, 15793, 19440, 22618, 25043]
24 : [3858, 15774, 19517, 23161]
25 : [22891]
26 : [1407, 4550, 11801, 13096, 13142]
27 : [11114, 19081, 24726, 25540]
28 : [7916]
29 : [20243]
45 : [570, 773, 1186, 1653, 2212, 2741, 2952, 3372, 4164, 4180, 4511, 4513, 6179, 6610, 6830, 7956, 8879, 9785, 11241, 11472, 12365, 12496, 12679, 12781, 12851, 14540, 14807, 15003, 15659, 17655, 17692, 18719, 18866, 18894, 19423, 19961, 20108, 20562, 20635, 21012, 21281, 21508, 21847, 22691, 22887, 23293, 24955, 25346, 25758]


---
### Problem 4: Who has the most coauthors?

Find the author who has the most coauthors. Print the id of that author and the number of coauthors they have. 

Solve this problem using iteration and conditionals; you are not allowed to use external modules. 


In [5]:
# Enter your answer to Problem 4 here. 

max_author = 0 
max_length = -1

for i in authors_dic:
    if max_length < len(authors_dic[i]):
        max_length = len(authors_dic[i])
        max_author = i

print('Author ID with maximum co-authors :', max_author)
print('Number of co-authors :', max_length)

Author ID with maximum co-authors : 21012
Number of co-authors : 81


---

### Evaluation

| Problem | Mark     | Comment   
|:-------:|:--------:|:----------------------
| 1       |   /3    |              
| 2       |   /4    | 
| 3       |   /3    | 
| 4       |   /4    | 
| Code legibility       |   /2    | 
| Code efficiency      |   /4    | 
|**Total**|**/20**  | 
