 # Excercise 4

## Problem 1

After parsing data from 'q2_dataset.txt' and 'q4_dataset.txt', I notice that data from the latter somehow makes more sense. First of all, all the venue names that differ by year are aggregated to one single venue. This change may eventually make the total number of venues smaller, which is more accurate. Also, there is one additional line starts with '#!' for each publication paper which I infer to be comments or notices. 

## Problem 2

### Part a

In [1]:
import os
import pyspark
from pyspark import SparkContext
from pyspark.mllib.fpm import FPGrowth
import re
SparkContext.setSystemProperty('spark.executor.memory','3g')
conf=pyspark.SparkConf().setAppName('test').setMaster('local[1]')
sc=pyspark.SparkContext(conf=conf)
rdd=sc.textFile('q4_dataset.txt')

In [2]:
rdd_new=rdd.filter(lambda x: re.match(r'^#@(.*)', x)).map(lambda x: re.match(r'^#@(.*)', x).group(1))

In [3]:
rdd_res = rdd_new.map(lambda line: line.strip().split(',')).map(lambda x: list(set(x)))

In [4]:
model1 = FPGrowth.train(rdd_res, 1e-4, 2)

In [5]:
model1.freqItemsets().collect()

[FreqItemset(items=['Vijay Kumar'], freq=349),
 FreqItemset(items=['David Maier'], freq=227),
 FreqItemset(items=['Ning Zhong'], freq=303),
 FreqItemset(items=['Umeshwar Dayal'], freq=225),
 FreqItemset(items=['Henri Prade'], freq=391),
 FreqItemset(items=['Daniel A. Keim'], freq=247),
 FreqItemset(items=['Albert Y. Zomaya'], freq=309),
 FreqItemset(items=['Dennis Sylvester'], freq=232),
 FreqItemset(items=['Hao Zhang'], freq=315),
 FreqItemset(items=['Xin Liu'], freq=290),
 FreqItemset(items=['Diane Crawford'], freq=250),
 FreqItemset(items=['Oscar C. Au'], freq=236),
 FreqItemset(items=['Herbert Edelsbrunner'], freq=239),
 FreqItemset(items=['Yue Wang'], freq=327),
 FreqItemset(items=['K. J. Ray Liu'], freq=374),
 FreqItemset(items=['Wayne Wolf'], freq=243),
 FreqItemset(items=['Haizhou Li'], freq=230),
 FreqItemset(items=['Leonidas J. Guibas'], freq=363),
 FreqItemset(items=['Joel H. Saltz'], freq=253),
 FreqItemset(items=['José Meseguer'], freq=295),
 FreqItemset(items=['Qiang Wang

In [6]:
model2 = FPGrowth.train(rdd_res, 1e-5, 2)
model2.freqItemsets().collect()

[FreqItemset(items=['Philip Levis'], freq=58),
 FreqItemset(items=['Eva Hudlicka'], freq=26),
 FreqItemset(items=['Pierre Vandergheynst'], freq=109),
 FreqItemset(items=['Pierre Vandergheynst', 'Pascal Frossard'], freq=23),
 FreqItemset(items=['Mark Minas'], freq=81),
 FreqItemset(items=['Martti Penttonen'], freq=32),
 FreqItemset(items=['Yen-Hsien Lee'], freq=23),
 FreqItemset(items=['Terrence S. T. Mak'], freq=30),
 FreqItemset(items=['Andrew W. Appel'], freq=85),
 FreqItemset(items=['John MacIntyre'], freq=28),
 FreqItemset(items=['Pierre Soille'], freq=57),
 FreqItemset(items=['Ana Cristina Vieira de Melo'], freq=26),
 FreqItemset(items=['Jan Stallaert'], freq=25),
 FreqItemset(items=['Tomonari Masada'], freq=26),
 FreqItemset(items=['Roberto Sabella'], freq=23),
 FreqItemset(items=['Ricardo C. Farias'], freq=30),
 FreqItemset(items=['Arnoud Visser'], freq=22),
 FreqItemset(items=['Ling Zhao'], freq=26),
 FreqItemset(items=['Darrell D. E. Long'], freq=120),
 FreqItemset(items=['Pee

In [7]:
model3 = FPGrowth.train(rdd_res, 0.5e-5, 2)
res = model3.freqItemsets().collect()

In [8]:
res

[FreqItemset(items=['Eva Hudlicka'], freq=26),
 FreqItemset(items=['Philip Levis'], freq=58),
 FreqItemset(items=['Philip Levis', 'David E. Culler'], freq=18),
 FreqItemset(items=['Yin Zhao'], freq=16),
 FreqItemset(items=['Pierre Vandergheynst'], freq=109),
 FreqItemset(items=['Pierre Vandergheynst', 'Pascal Frossard'], freq=23),
 FreqItemset(items=['Pierre Vandergheynst', 'Jean-Philippe Thiran'], freq=16),
 FreqItemset(items=['Mark Minas'], freq=81),
 FreqItemset(items=['Martti Penttonen'], freq=32),
 FreqItemset(items=['Martti Penttonen', 'Ville Leppänen'], freq=11),
 FreqItemset(items=['Yen-Hsien Lee'], freq=23),
 FreqItemset(items=['Yen-Hsien Lee', 'Chih-Ping Wei'], freq=14),
 FreqItemset(items=['Terrence S. T. Mak'], freq=30),
 FreqItemset(items=['Andrew W. Appel'], freq=85),
 FreqItemset(items=['Even-André Karlsson'], freq=15),
 FreqItemset(items=['John MacIntyre'], freq=28),
 FreqItemset(items=['Pierre Soille'], freq=57),
 FreqItemset(items=['Ana Cristina Vieira de Melo'], freq

When I tried to train model with support threshold of 1e-6, the program just crashed with a 'out of memory error'. As the pattern shown in the three thresholds above, the number of frequent itemsets is increasing and the upper bound of frequency decresed. So I think it is because when the threshold is really low and there would be a great number of frequent itemsets generated whhich costs a lot memory and system resources. 

### Part b

To find co-author relationship, I save the freqent itemset result in a list called 'res'. And use function 'findCoAuthor' to filter targeted author and sort the result. 

In [9]:
def takeSecond(tup):
    return tup[1]
def findCoAuthor(fis, name):
    l = []
    for fi in fis:
        if name in fi[0]:
            l.append(fi)
    l.sort(key=takeSecond, reverse=True)
    return l[:5]
ra = findCoAuthor(res, 'Rakesh Agrawal')
jh = findCoAuthor(res, 'Jiawei Han')
zg = findCoAuthor(res, 'Zoubin Ghahramani')
cf = findCoAuthor(res, 'Christos Faloutsos')

In [10]:
ra

[FreqItemset(items=['Rakesh Agrawal'], freq=199),
 FreqItemset(items=['Ramakrishnan Srikant', 'Rakesh Agrawal'], freq=33),
 FreqItemset(items=['Jerry Kiernan', 'Rakesh Agrawal'], freq=16),
 FreqItemset(items=['Rakesh Agrawal', 'H. V. Jagadish'], freq=15),
 FreqItemset(items=['Yirong Xu', 'Rakesh Agrawal'], freq=11)]

The top-5 co-authors for Rakesh Agrawal are Ramakrishnan Srikant, Jerry Kiernan, H. V. Jagadish, Roberto J. Bayardo Jr. and Yirong Xu. 

In [11]:
jh

[FreqItemset(items=['Jiawei Han'], freq=581),
 FreqItemset(items=['Xifeng Yan', 'Jiawei Han'], freq=55),
 FreqItemset(items=['Jiawei Han', 'Philip S. Yu'], freq=51),
 FreqItemset(items=['Jian Pei', 'Jiawei Han'], freq=37),
 FreqItemset(items=['Yizhou Sun', 'Jiawei Han'], freq=32)]

The top-5 co-authors of Jiawei Han are Xifeng Yan, Philip S. Yu, Jian Pei, Yizhou Sun, Xin Jin. 

In [12]:
zg

[FreqItemset(items=['Zoubin Ghahramani'], freq=157),
 FreqItemset(items=['David L. Wild', 'Zoubin Ghahramani'], freq=15),
 FreqItemset(items=['Katherine A. Heller', 'Zoubin Ghahramani'], freq=13),
 FreqItemset(items=['Zoubin Ghahramani', 'Michael I. Jordan'], freq=11)]

The top-5 co-authors of Zoubin Ghahramani are David L. Wild, Katherine A. Heller, Michael I. Jordan.

In [13]:
cf

[FreqItemset(items=['Christos Faloutsos'], freq=374),
 FreqItemset(items=['Hanghang Tong', 'Christos Faloutsos'], freq=27),
 FreqItemset(items=['Spiros Papadimitriou', 'Christos Faloutsos'], freq=26),
 FreqItemset(items=['Jimeng Sun', 'Christos Faloutsos'], freq=24),
 FreqItemset(items=['Agma J. M. Traina', 'Christos Faloutsos'], freq=24)]

The top-5 co-authors of Christos Faloutsos are Hanghang Tong, Spiros Papadimitriou, Jimeng Sun, Agma J. M. Traina, Caetano Traina Jr..

## Problem 3

### Part a

For Problem 3, we have to first construct a list of venues for each author. So use the same method as in Excercise 2, I import `pandas` and do some wrangling with regex and `pandas` functions. And then call `FPGrowth.train()` to find frequent itemsets according to target support threshold and `collect()` to bring in the results. 

In [14]:
import pandas as pd
df = pd.read_csv('q4_dataset.txt', delimiter='\t')
df.columns = ['col']
authors=df[df['col'].str.contains(r'^#@(.*)')]
authors=authors['col'].str.replace('#@', '')
venues=df[df['col'].str.contains(r'^#c(.*)')]
venues=venues['col'].str.replace('#c', '')

  after removing the cwd from sys.path.
  


In [15]:
authors=authors.reset_index(drop=True)
venues=venues.reset_index(drop=True)

In [16]:
df_new = pd.concat([authors, venues], axis=1)

In [17]:
df_new.columns = ['authors', 'venue']

In [18]:
df_new = df_new.drop('authors', axis=1).join(df_new['authors'].str.split(',', expand=True).stack().reset_index(level=1, drop=True).rename('author'))

In [19]:
author_agg=df_new.groupby('author').agg(','.join).reset_index()

In [20]:
author_agg

Unnamed: 0,author,venue
0,,"The INGRES Papers,The INGRES Papers,The TSQL2 ..."
1,Abhay Harpale,KDD
2,Ai Wen,CSWS
3,Aihua Bao,CSWS
4,Bin Han,CSWS
5,Bo Andersson,CSWS
6,Bo Wang,CSWS
7,Chen Ting Zhao,CSWS
8,Christos Faloutsos,KDD
9,Chun E Ma,CSWS


In [21]:
author_agg=author_agg.drop(df.head(1).index)

In [22]:
author_agg['venue'].to_csv('author_agg.csv', sep='\n', index=False)

In [23]:
author_venues=sc.textFile('author_agg.csv')

In [24]:
venue_train = author_venues.map(lambda line: line.strip().split(',')).map(lambda x: list(set(x)))

In [25]:
venue1 = FPGrowth.train(venue_train, 1e-3, 2)
venue_res1 = venue1.freqItemsets().collect()

In [26]:
venue_res1

[FreqItemset(items=['IEEE Transactions on Information Technology in Biomedicine'], freq=3930),
 FreqItemset(items=['RoboCup'], freq=1651),
 FreqItemset(items=['Knowl.-Based Syst.'], freq=3027),
 FreqItemset(items=['PaCT'], freq=1588),
 FreqItemset(items=['ISSCC'], freq=4950),
 FreqItemset(items=['KES (3)'], freq=1913),
 FreqItemset(items=['Multimedia Tools Appl.'], freq=3174),
 FreqItemset(items=['EUROMICRO'], freq=1734),
 FreqItemset(items=['Biological Cybernetics'], freq=3453),
 FreqItemset(items=['ACL'], freq=2800),
 FreqItemset(items=['SOFSEM'], freq=1331),
 FreqItemset(items=['IEEE SCC'], freq=1970),
 FreqItemset(items=['J. Comb. Optim.'], freq=1303),
 FreqItemset(items=['Int. J. Math. Mathematical Sciences'], freq=1775),
 FreqItemset(items=['OTM Workshops'], freq=1814),
 FreqItemset(items=['SEKE'], freq=3650),
 FreqItemset(items=['GECCO'], freq=4545),
 FreqItemset(items=['International Conference on Internet Computing'], freq=1857),
 FreqItemset(items=['VTS'], freq=1702),
 FreqIt

In [None]:
venue2 = FPGrowth.train(venue_train, 0.4e-3, 2)
venue_res2 = venue2.freqItemsets().collect()

In [None]:
venue_res2

With the same trend as in Problem 2, as the support threshold goes small, the frequent itemset becomes bigger and bigger, and the frequency is also increasing. And also the same out of memory error when I try to train the model with the smallest support threshold. 

### Part b

In [None]:
def findVenue(fis, name):
    l = []
    for fi in fis:
        if name in fi[0]:
            l.append(fi)
    l.sort(key=takeSecond, reverse=True)
    return l[:15]
nips = findVenue(venue_res2, 'NIPS')
kdd = findVenue(venue_res2, 'KDD')
vldb = findVenue(venue_res2, 'VLDB')
infocom = findVenue(venue_res2, 'INFOCOM')
acl = findVenue(venue_res2, 'ACL')

In [None]:
nips

Within the area of machine learning, based on NIPS, the top 10 venues that authors also publish in are CoRR, ICML, Neural Computation, Journal of Machine Learning Research - Proceedings Track, Journal of Machine Learning Research, IEEE Trans. Pattern Anal. Mach. Intell., CVPR, Neurocomputing and Neural Networks. 

In [None]:
kdd

Within the area of data mining, based on KDD, the top 10 venues that authors also publish in are CoRR, ICDM, CIKM, IEEE Trans. Knowl. Data Eng., SDM, ICML, WWW. 

In [None]:
vldb

Within the area of database, based on VLDB, the top 10 venues that authors also publish in are ICDE, SIGMOD Conference, CoRR, IEEE Trans. Knowl. Data Eng., SIGMOD Record, EDBT, CIKM, IEEE Data Eng. Bull., VLDB J., ACM Trans. Database Syst..

In [None]:
infocom

Within the area of computer networks, based on INFOCOM, the top 10 venues that authors also publish in are GLOBECOM, ICC, CoRR, IEEE/ACM Trans. Netw., IEEE Journal on Selected Areas in Communications, Computer Networks, Computer Communications, ICDCS, IEEE Trans. Parallel Distrib. Syst..

In [None]:
acl

Within the area of natural language processing, based on ACL, the top 10 venues that authors also publish in are COLING, LREC, CoRR, EMNLP, HLT-NAACL, INTERSPEECH. 