# Topic Modeling of Publications in Grenoble - France
#### Name: Claudia Avila
The municipality of Grenoble is launching the GRENOBLE4RESEARCH initiative, aimed at fostering the
local network of research actors. You should help the initiative, by producing a short report about the
research activity in the city, by discovering the degree of collaboration among local actors and by
exploring which fields or areas of research have been emerging in the last 10 years.

### Brief Report
The main objective of this exercise was to explore research activity in Grenoble during the last 10 years (2009 -2018). For this purpose, 5886 publications of Grenoble from HAL open repository were analyzed through different steps of data cleaning and processing data, as well as texts tokenization, a LDA model and Countvectorizer and Gridsearchcv methods.  Regarding the results, the number of publications in Grenoble increased over the years but showed significant fluctuations every two years period, reaching a peak in 2018 (with 973 publications). Therefore, given that one of the purposes was to analyze local actors for Grenoble4Research initiative, we observed that the governmental research organization called Centre of National Scientific Research was the institution with the highest production of publications (3837) in Grenoble. Such institution displayed different areas of research as department, laboratory, and research team, but most publications would have been developed as 'institution'. On the other hand, the analysis on the relevant fields of publications illustrated that 'Human Sciences and Society' has been the main domain in Grenoble during the decade in analysis, with 2220 publications, while 'Non-linear Sciences was at the bottom of the overall production with only 3 publications. However, regarding the specialization index, Nonlinear Sciences' in Grenoble occupied an important role in the production of publications comparatively to France. Considering this index, 7 of 12 domains reviewed in Grenoble showed a comparative advantage production of publications regarding the whole production in France. According to the collaboration network analysis, most of the publications were realized in collaboration (97%): 5,665 were done in networks, while only 167 publications (3%) were done by just one institution. This result displays a high level of networks among local research institutions in Grenoble. Notably, the Centre of National Scientific Research highlights as the leading actor in research relationships in Grenoble since it has produced most of its publications (3264/3837) in co-authorship with other institutions.  Finally, the abstracts analysis allowed us to classify publications in ten relevant topics, where the most common were related to Data analysis models (topic 7) and Social and political research (topic4)

## task 0: data extraction
You will download the data from the HAL repository . HAL is an open archive, where researchers can
deposit scholarly documents from all academic fields. By using HAL open API , you should first obtain all
publications produced within the municipality of Grenoble, from 2009 to 2018 . HAL provides a fairly
extensive documentation , which can help you build the queries to retrieve the data of interest.

First of all, it was necessary to install requests and pipenv libraries to download the dataset from HAL open API. The request query was filtered in order to get information of publications produced in the city of Grenoble from 2009 to 2018. The main fields processed for this test were:

    docid:id of publication
    city_s:city of publication
    submittedDate:year when the publication was submitted
    structname_s:names of institutions that produced the publication
    structType_s:type of institucion
    doctype_s:type of publication
    en_abstract_s:english abstract of publication
    domain_s:domain of the publication

In [1]:
import csv
import requests
import pandas as pd
import io
import sys
import warnings

if not sys.warnoptions:
    warnings.simplefilter("ignore")
    
response=requests.get('http://api.archives-ouvertes.fr/search/?q=*:*&wt=csv&fq=submittedDateY_i:[2009 TO 2018]&fq=city_s:Grenoble&fl=docid,city_s,submittedDateY_i,structName_s,structType_s,docType_s,en_abstract_s,domain_s&sort=docid asc&cursorMark=*&rows=10000', auth=('user', 'pass'))
text=response.iter_lines(decode_unicode='utf-8')
reader=csv.reader(text,delimiter=',')
df=pd.read_csv(io.StringIO(response.text),sep=',', header=0)
df.sort_values(['docid'],ascending=False)
df


Unnamed: 0,docid,city_s,submittedDateY_i,structName_s,structType_s,docType_s,en_abstract_s,domain_s
0,68659,Grenoble,2009,"Laboratoire d'électrotechnique de Grenoble,Ins...","laboratory,institution,institution,laboratory,...",COMM,,"0.spi,1.spi.other"
1,170222,Grenoble,2012,"Laboratoire de mécanique des solides,École pol...","laboratory,institution,institution,institution...",COMM,,"0.phys,1.phys.meca,2.phys.meca.vibr,0.spi,1.sp..."
2,171064,Grenoble,2009,"Institut de Physique Nucléaire de Lyon,Univers...","laboratory,institution,regroupinstitution,inst...",POSTER,The Supernovae Integral Field Spectrograph: ke...,"0.phys,1.phys.astr,2.phys.astr.co,0.sdu,1.sdu...."
3,171587,Grenoble,2009,"Acquisition\, representation and transformatio...","researchteam,laboratory,institution,laboratory...",COMM,"In this paper\, we present a new algorithm for...","0.info,1.info.info-gr"
4,187631,Grenoble,2012,"Interaction Collaborative\, Teleformation\, Te...","laboratory,institution,regroupinstitution,inst...",COMM,E-Laboratories are important components of mod...,"0.info,1.info.info-oh"
5,192925,Grenoble,2013,"Laboratoire de mécanique des sols\, structures...","laboratory,institution,institution,institution...",COMM,,"0.phys,1.phys.cond,2.phys.cond.cm-ms"
6,192929,Grenoble,2013,"Centre des Matériaux,Centre National de la Rec...","laboratory,institution,regroupinstitution,inst...",COMM,,"0.phys,1.phys.cond,2.phys.cond.cm-ms"
7,192936,Grenoble,2013,"Centre des Matériaux,Centre National de la Rec...","laboratory,institution,regroupinstitution,inst...",COMM,,"0.phys,1.phys.cond,2.phys.cond.cm-ms"
8,192942,Grenoble,2013,"ONERA - The French Aerospace Lab [Châtillon],O...","laboratory,institution,laboratory,institution,...",COMM,,"0.phys,1.phys.cond,2.phys.cond.cm-ms"
9,192943,Grenoble,2013,"Centre des Matériaux,Centre National de la Rec...","laboratory,institution,regroupinstitution,inst...",COMM,,"0.phys,1.phys.cond,2.phys.cond.cm-ms"


As it is shown, there is 5693 publications in Grenoble between 2009 and 2018. The variable that contains the abstract of each document ('en_abstract_es') shows the lowest number of valid values in the dataset (1,698).

In [2]:
df.shape

(5886, 8)

In [3]:
df.count()

docid               5886
city_s              5886
submittedDateY_i    5886
structName_s        5881
structType_s        5881
docType_s           5886
en_abstract_s       1698
domain_s            5577
dtype: int64

Given that one of the objetives of this study is to identify the main research actors, we procceeded with the identification of the categories. In the dataset of publications for Grenoble 11 categories were found. Most of categories were considered relevant for the analysis because they were related to articles, working papers, conferences, reports which provide information about local actors. Two categories, 'IMG' and 'MEM', which refers to images and student memories were not considerer relevant for the identification of main institutions involved in the production of publications in Grenoble.

#### Type of documents:

ART:Article in a magazine

COMM:Communication in a congress

COUV: work chapter

DOUV:Project Management, Proceedings, Dossie

IMG:image

LECTURE: conference

MEM: student memory

OTHER:Other report, seminar, workshop

POSTER: poster

PRESCONF:Document associated with scientific events

UNDEFINED:pre-publication, working paper

In [4]:
df.groupby('docType_s')['docid'].count()

docType_s
ART             2
COMM         5576
COUV            4
DOUV           81
IMG            29
LECTURE         2
MEM            20
OTHER           1
POSTER        162
PRESCONF        7
UNDEFINED       2
Name: docid, dtype: int64

In [5]:
df.set_index("docType_s",inplace=True)
not_publication=['IMG','MEM']
df.drop(not_publication,axis=0,inplace = True) 
df.reset_index(inplace=True)

We verified that the categories were correctly erased

In [6]:
df["docType_s"].unique()

array(['COMM', 'POSTER', 'COUV', 'DOUV', 'ART', 'OTHER', 'UNDEFINED',
       'PRESCONF', 'LECTURE'], dtype=object)

Finally, we have an initial working database with 5837 publications

In [7]:
original=df
df.shape

(5837, 8)

## task 1: time trends
In this part, you will calculate the temporal evolution of the volume of publications produced by
#### 1a. all actors;

We obtain the results of volume of publication per year for the ten years period in analysis (2009-2018). 


In [8]:
df.rename(columns={'submittedDateY_i': 'anio'},inplace=True)
df_anio=df.groupby(['anio'])['docid'].agg('count').reset_index()
df_anio.sort_values(['anio'],ascending=True)

Unnamed: 0,anio,docid
0,2009,482
1,2010,446
2,2011,417
3,2012,490
4,2013,506
5,2014,684
6,2015,436
7,2016,754
8,2017,649
9,2018,973


Overall, it can be seen that the number of publications in Grenoble increased over the years, but it illustrates important fluctuations each two years period. From 2010 to 2014, the number of publications was increasing, but in 2015 it dropped to 436 from 684 publications in 2014. From 2016 onwards, the number of publications has increased and it reached 973 publications in 2018, after a drop in 2017 (107 publications less than 2016).

In [9]:
import seaborn as sns
sns.lineplot(data = df_anio, x="anio", y="docid")

<matplotlib.axes._subplots.AxesSubplot at 0x2587bdbae48>

#### 1b. the top 5 active actors.
In this task, it is expected to identify the number of publication per institution, which implied the change of the dataset from wide to long. The first step was to divide the field that contained the names of institutions 'structName_s'. We also added the field that identifies the type of institution that differentiates institutions, laboratories, regroup institutions, regroup laboratories, research teams, or department. It is important to add this field to provide with more detailed information about local actors for Grenoble4Research initiative.

In [10]:
df[['i1','i2','i3','i4','i5','i6','i7','i8','i9','i10','i11','i12','i13','i14','i15','i16','i17','i18','i19','i20','i21','i22','i23','i24','i25','i26','i27','i28','i29','i30']] = df['structName_s'].str.split(',',29, expand=True)
df[['t1','t2','t3','t4','t5','t6','t7','t8','t9','t10','t11','t12','t13','t14','t15','t16','t17','t18','t19','t20','t21','t22','t23','t24','t25','t26','t27','t28','t29','t30']] = df['structType_s'].str.split(',',29, expand=True)

In [11]:
long=pd.wide_to_long(df, stubnames=['i', 't'], i=['docid','anio','en_abstract_s','domain_s'], j='dropme')\
  .reset_index()\
  .drop(['dropme','structType_s','structName_s','city_s'], axis=1)\
  .sort_values('docid')\
  .dropna(subset=["t"])
long.head()
long=long.drop_duplicates()

Once we  had the dataset with each row by institution, we could apply a groupby function to get the results of their total publications. We observe that 'Centre National de la Recherche Scientifique' is the institution with the highest production of publications (3837). The second institution is the 'Université Grenoble Alpes' with 1476 publications, which is less than the half of publications than 'Centre National de la Recherche Scientifique'. This result can be explained by the fact that this Centre is the largest governmental research organisation in France and the largest fundamental science agency in Europe.


The other top three institutions in this ranking are: 'Université Joseph Fourier' (with 1367 publications),'Université Pierre Mendès France' (with 1024) and 'Institut National Polytechnique de Grenoble' (with 583)

In [12]:
df_insti=long.groupby(['i'])['docid'].agg('count').reset_index().sort_values(['docid'],ascending=False)
df_insti.head(5)

Unnamed: 0,i,docid
1024,Centre National de la Recherche Scientifique,3837
4317,Université Grenoble Alpes,1476
4332,Université Joseph Fourier - Grenoble 1,1367
4374,Université Pierre Mendès France - Grenoble 2,1024
2598,Institut polytechnique de Grenoble - Grenoble ...,583


The 'Centre National de la Recherche Scientifique' has different areas of research including department, laboratory, researchteam, but the most of the publications are developed in the area of 'institution' 

In [13]:
top=long[(long['i']=='Centre National de la Recherche Scientifique')]
top.pivot_table(index=["t"],values=["docid"], columns=["anio"],aggfunc="count", fill_value=" ")

Unnamed: 0_level_0,docid,docid,docid,docid,docid,docid,docid,docid,docid,docid
anio,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018
t,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
department,1.0,2.0,1.0,,4.0,,1.0,4.0,3.0,4.0
institution,347.0,328.0,257.0,316.0,311.0,424.0,273.0,447.0,347.0,388.0
laboratory,29.0,25.0,12.0,28.0,23.0,43.0,14.0,22.0,27.0,36.0
regroupinstitution,3.0,2.0,1.0,,1.0,3.0,5.0,7.0,5.0,7.0
regrouplaboratory,,,1.0,,4.0,,1.0,3.0,,2.0
researchteam,8.0,10.0,2.0,11.0,10.0,8.0,8.0,6.0,4.0,8.0


## task 2: statistics per scientific domain
HAL features a native taxonomy of scientific domains . For local decision makers, the specialisation of
actors in certain scientific fields is a useful piece of information. To obtain this, you should now:
#### 2a. get statistics of number of publications per field;



In this task it is expected to analyse the number of publications per field. Hence, in the same way as in the first task, it was necessary to use the long dataset and divide the column that identify the field of each publication. Then, we assigned the labels of each domain.

In [14]:
import numpy as np
#splitting into columns of domains
long2=long
long2['domain_s'] = long2['domain_s'].astype(str)
long2[['d1','d2','d3','d4','d5','d6','d7','d8','d9','d10','d11','d12','d13','d14','d15','d16','d17','d18','d19','d20','d21','d22','d23']]=long2['domain_s'].str.split(',',22, expand=True)

In [15]:
long3=pd.wide_to_long(long2, stubnames=['d'], i=['docid','anio','docType_s','en_abstract_s','i','t'], j='dropme')\
  .reset_index()\
  .drop(['dropme','domain_s','en_abstract_s','docType_s'], axis=1)\
  .sort_values('docid')\
  .rename(columns={'d':'domain'})\
  .drop_duplicates()
long3.head()


Unnamed: 0,docid,anio,i,t,domain
0,68659,2009,Laboratoire d'électrotechnique de Grenoble,laboratory,0.spi
84,68659,2009,Laboratoire des images et des signaux,laboratory,
70,68659,2009,Laboratoire des images et des signaux,laboratory,1.spi.other
69,68659,2009,Laboratoire des images et des signaux,laboratory,0.spi
68,68659,2009,Centre National de la Recherche Scientifique,institution,


In [16]:
import numpy as np
#cleaning the columns of domains
long3.domain = np.where(long3.domain.str.contains('0.'),long3.domain,'')
long3.domain = long3.domain.str.replace('0.','')
long3=long3[(long3['domain'] != '')]


In [17]:
d2m={"domain":{'chim': 'Chimie','info' : 'Informatique', 'math' : 'Mathématiques', 'nlin' : 'Science non linéaire',\
               'phys' : 'Physique', 'qfin' : 'Économie et finance quantitative', 'scco' : 'Sciences cognitives',\
               'sde' : 'Sciences de lenvironnement', 'sdu' : 'Planète et Univers', 'sdv' : 'Sciences du Vivant', \
               'shs' : 'Sciences de lHomme et Société', 'spi' : 'Sciences de lingénieur', 'stat' : 'Statistiques'}}
long3.replace(d2m,inplace=True)

The following table shows the number of publications by domain between 2009-2018. It can be pointed out that 'Human Sciences and Society' is the main domain of publications in Grenoble during this decade, with 2220 publications, while 'Non linear Sciences' is at the bottom of the overall production with only 3 publications. The domains ranked as second and third are 'Engineering sciences' and 'Informatics' and they showed quite similar production of publications (with 1,240 and 1,178 publications, respectively). 'Physis' and 'Enviroment Sciences' are strong domains with more than 500 publications while the remaining domains lagged far behind in production of publications with less of 200 publications during the ten years period in analysis.

We should take into account that the sum of number of publications of all domains is not the same as in the original dataset (which had 5837 publications) and this is explained by the fact that one publication can be linked to more than one domain.

In [18]:
long4=long3[['docid','domain']]
long4.drop_duplicates(inplace=True)
df_domain=long4.groupby(['domain'])['docid'].agg('count').reset_index().sort_values(['docid'],ascending=False)


In [19]:
df_domain

Unnamed: 0,domain,docid
7,Sciences de lHomme et Société,2220
9,Sciences de lingénieur,1240
1,Informatique,1178
3,Physique,637
8,Sciences de lenvironnement,303
10,Sciences du Vivant,172
0,Chimie,144
4,Planète et Univers,135
2,Mathématiques,130
6,Sciences cognitives,104


#### 2b. calculate the Grenoble specialization index in each field with respect to France (i.e. the whole HAL database, i.e. without filtering)

Here we repeat the development of measures and cleaning process as in the case of Grenoble's publications

In [20]:
response=requests.get('http://api.archives-ouvertes.fr/search/?q=*:*&wt=csv&fq=submittedDateY_i:[2009 TO 2013]&fq=country_s:fr&fl=docid,city_s,submittedDateY_i,structName_s,structType_s,docType_s,domain_s&sort=docid asc&start=0&rows=10000', auth=('user', 'pass'))
text=response.iter_lines(decode_unicode='utf-8') 
reader=csv.reader(text,delimiter=',')
france1=pd.read_csv(io.StringIO(response.text),sep=',', header=0)

In [21]:
response=requests.get('http://api.archives-ouvertes.fr/search/?q=*:*&wt=csv&fq=submittedDateY_i:[2014 TO 2018]&fq=country_s:fr&fl=docid,city_s,submittedDateY_i,structName_s,structType_s,docType_s,domain_s&sort=docid asc&start=0&rows=10000', auth=('user', 'pass'))
text=response.iter_lines(decode_unicode='utf-8') 
reader=csv.reader(text,delimiter=',')
france2=pd.read_csv(io.StringIO(response.text),sep=',', header=0)

In [22]:
france=pd.concat([france1, france2], ignore_index=True)


In [23]:
#institutions and type
france[['i1','i2','i3','i4','i5','i6','i7','i8','i9','i10','i11','i12','i13','i14','i15','i16','i17','i18','i19','i20','i21','i22','i23','i24','i25','i26','i27','i28','i29','i30']] = france['structName_s'].str.split(',',29, expand=True)
france[['t1','t2','t3','t4','t5','t6','t7','t8','t9','t10','t11','t12','t13','t14','t15','t16','t17','t18','t19','t20','t21','t22','t23','t24','t25','t26','t27','t28','t29','t30']] = france['structType_s'].str.split(',',29, expand=True)

In [24]:
france_long=pd.wide_to_long(france, stubnames=['i', 't'], i=['docid','submittedDateY_i','docType_s','domain_s'], j='dropme')\
  .reset_index()\
  .drop(['dropme','structType_s','structName_s','city_s'], axis=1)\
  .sort_values('docid')\
  .dropna(subset=["t"])\
  .rename(columns={'submittedDateY_i': 'anio'})
france_long.head()

Unnamed: 0,docid,anio,docType_s,domain_s,i,t
0,9081,2013,COMM,"0.phys,1.phys.phys,2.phys.phys.phys-ins-det",Institut de Physique Nucléaire d'Orsay,laboratory
13,9081,2013,COMM,"0.phys,1.phys.phys,2.phys.phys.phys-ins-det",Institut National de Physique Nucléaire et de ...,institution
14,9081,2013,COMM,"0.phys,1.phys.phys,2.phys.phys.phys-ins-det",Centre National de la Recherche Scientifique,institution
1,9081,2013,COMM,"0.phys,1.phys.phys,2.phys.phys.phys-ins-det",Université Paris-Sud - Paris 11,institution
2,9081,2013,COMM,"0.phys,1.phys.phys,2.phys.phys.phys-ins-det",Institut National de Physique Nucléaire et de ...,institution


In [25]:
france_long=france_long.drop_duplicates()
france_anio=france_long.groupby(['anio'])['docid'].agg('count').reset_index().sort_values(['docid'],ascending=False)


In [26]:
#splitting columns fields
france_long['domain_s'] = france_long['domain_s'].astype(str)
france_long[['d1','d2','d3','d4','d5','d6','d7','d8','d9','d10','d11','d12','d13','d14','d15','d16','d17','d18','d19','d20','d21','d22','d23','d24','d25','d26','d27','d28','d29']]=france_long['domain_s'].str.split(',',28, expand=True)


In [27]:
#wide to long
france_long3=pd.wide_to_long(france_long, stubnames=['d'], i=['docid','anio','docType_s','i','t'], j='dropme')\
  .reset_index()\
  .drop(['dropme','domain_s'], axis=1)\
  .sort_values('docid')\
  .drop_duplicates()\
  .rename(columns={'d':'domain'})
france_long3.head()


Unnamed: 0,docid,anio,docType_s,i,t,domain
0,9081,2013,COMM,Institut de Physique Nucléaire d'Orsay,laboratory,0.phys
197,9081,2013,COMM,Université de Caen Normandie,institution,
215,9081,2013,COMM,Grand Accélérateur National d'Ions Lourds,laboratory,
205,9081,2013,COMM,Grand Accélérateur National d'Ions Lourds,laboratory,2.phys.phys.phys-ins-det
204,9081,2013,COMM,Grand Accélérateur National d'Ions Lourds,laboratory,1.phys.phys


In [28]:
import numpy as np
#cleaning the columns of domains
france_long3.domain = np.where(france_long3.domain.str.contains('0.'),france_long3.domain,'')
france_long3.domain = france_long3.domain.str.replace('0.','')
france_long3=france_long3[(france_long3['domain'] != '')]

In [29]:
d2m={"domain":{'chim': 'Chimie','info' : 'Informatique', 'math' : 'Mathématiques', 'nlin' : 'Science non linéaire',\
               'phys' : 'Physique', 'qfin' : 'Économie et finance quantitative', 'scco' : 'Sciences cognitives',\
               'sde' : 'Sciences de lenvironnement', 'sdu' : 'Planète et Univers', 'sdv' : 'Sciences du Vivant', \
               'shs' : 'Sciences de lHomme et Société', 'spi' : 'Sciences de lingénieur', 'stat' : 'Statistiques'}}
france_long3.replace(d2m,inplace=True)

In [30]:
france_long4=france_long3[['docid','domain']]
france_long4.drop_duplicates(inplace=True)
france_domain=france_long4.groupby(['domain'])['docid'].agg('count').reset_index().sort_values(['docid'],ascending=False)
france_domain.rename(columns={'docid':'france_publications'},inplace=True)
france_domain


Unnamed: 0,domain,france_publications
7,Sciences de lHomme et Société,6888
9,Sciences de lingénieur,3514
1,Informatique,2998
3,Physique,1789
8,Sciences de lenvironnement,1506
0,Chimie,843
4,Planète et Univers,840
10,Sciences du Vivant,781
2,Mathématiques,605
6,Sciences cognitives,248


In [31]:
df_index=pd.merge(france_domain,df_domain,how="inner",on='domain',indicator=True)
df_index.rename(columns={'docid': 'grenoble_publications'}, inplace=True)
df_index = df_index[(df_index['_merge']=='both')]
df_index.drop(columns=['_merge'],inplace=True)
df_index

Unnamed: 0,domain,france_publications,grenoble_publications
0,Sciences de lHomme et Société,6888,2220
1,Sciences de lingénieur,3514,1240
2,Informatique,2998,1178
3,Physique,1789,637
4,Sciences de lenvironnement,1506,303
5,Chimie,843,144
6,Planète et Univers,840,135
7,Sciences du Vivant,781,172
8,Mathématiques,605,130
9,Sciences cognitives,248,104


The next table creates the specialization index, which measures the proportion of publications in each domain produced in Grenoble, normalised by the proportion of publications in each domain produced in France. The value of the index < 1 indicates a comparative disadvantage and a value > 1 represents specialization in the domain.

In [32]:
df_index['x']=(df_index.grenoble_publications/df_index['grenoble_publications'].sum())
df_index['m']=(df_index.france_publications/df_index['france_publications'].sum())
df_index['percent']=df_index['grenoble_publications']/df_index['france_publications']
df_index['specializat_index']=df_index['x']/df_index['m']
df_index.sort_values(['specializat_index'],ascending=False)

Unnamed: 0,domain,france_publications,grenoble_publications,x,m,percent,specializat_index
12,Science non linéaire,4,3,0.000475,0.000198,0.75,2.397271
9,Sciences cognitives,248,104,0.016456,0.012277,0.419355,1.340409
2,Informatique,2998,1178,0.186392,0.148408,0.392929,1.255942
3,Physique,1789,637,0.100791,0.08856,0.356065,1.138112
1,Sciences de lingénieur,3514,1240,0.196203,0.173952,0.352874,1.127913
10,Statistiques,151,50,0.007911,0.007475,0.331126,1.058398
0,Sciences de lHomme et Société,6888,2220,0.351266,0.340973,0.3223,1.030186
7,Sciences du Vivant,781,172,0.027215,0.038661,0.22023,0.703936
8,Mathématiques,605,130,0.02057,0.029949,0.214876,0.686821
4,Sciences de lenvironnement,1506,303,0.047943,0.074551,0.201195,0.643093


The specialization index reveals that seven from twelve domains analyzed in Grenoble show a comparative advantage in production of publications regarding the whole production in France. In detail, the 3 top domains where Grenoble occupies an important rol in the production of publications comparatively to France are 'Non linear Sciences (2.39), Cognitive Sciences (1.34) and 'Informatics' (1.25), while the least relevant were 'Economics and Quantitative Finance' (0.37) and 'Universe and Planet' (0.51).


#### 2c. get the top 5 actors in the most relevant field (in terms of specialization) in the city.

When it comes to mention the top 5 institutions in the most relevant fields in terms of the previous specialization index, we could see that 'Non linear Sciences' would be the main domain and the Centre national os Scientific Research the most relevant institution in such domain, with three publications. It is also noticeable that the following top institutions relevant in such field are 36 institutions who produced one publication, which very likely worked in collaboration

In [34]:
df_insti2=long3[(long3.domain=='Science non linéaire')].groupby(['i','t','domain'])['docid'].agg('count').reset_index().sort_values(['docid'],ascending=False)
df_insti2.rename(columns={'i': 'institution','t':'type','docid':'N publications'},inplace=True)
df_insti2.head(50)

Unnamed: 0,institution,type,domain,N publications
5,Centre National de la Recherche Scientifique,institution,Science non linéaire,3
0,Description and Control from Image Sequences,institution,Science non linéaire,1
28,Université Paris Diderot - Paris 7,institution,Science non linéaire,1
21,Institut des Systèmes Complexes - Paris Ile-de...,laboratory,Science non linéaire,1
22,Institut national de recherche en sciences et ...,laboratory,Science non linéaire,1
23,Laboratoire Jean Kuntzmann,laboratory,Science non linéaire,1
24,Sorbonne Université,institution,Science non linéaire,1
25,Université Grenoble Alpes,institution,Science non linéaire,1
26,Université Joseph Fourier - Grenoble 1,institution,Science non linéaire,1
27,Université Panthéon-Sorbonne,institution,Science non linéaire,1


## task 3: collaboration network
The collaboration network will help local decision makers understand how local actors have been
establishing relationships. A collaboration occurs when two (or more) institutions appear together in the
same publication. You should now identify the five most relevant/connected/central actors in the
network resulting from the coauthorship of publications.

A new variable that identifies whether the publication was done in collaboration (it will be called 'collabora') is added

In [35]:
import numpy as np
df_collabora=long.groupby(['docid'])['i'].agg('count').reset_index().sort_values(['docid'],ascending=False)
df_collabora['collabora']=pd.cut(df_collabora.i, [0, 1, np.inf], labels=[0,1])
df_collabora['collabora'] = df_collabora['collabora'].apply(pd.to_numeric, errors='coerce')
df_collabora.head()

Unnamed: 0,docid,i,collabora
5831,1966953,3,1
5830,1966813,18,1
5829,1966792,9,1
5828,1966731,11,1
5827,1966726,13,1


The results indicate that most of documents were developed in collaboration (97%): 5,665 publications were done in networks, while only 167 were done by just one institution

In [36]:
collabora=df_collabora.groupby(['collabora'])['docid'].agg('count').reset_index()
collabora['percent']=round(collabora['docid']*100/collabora['docid'].sum())
collabora

Unnamed: 0,collabora,docid,percent
0,0,167,3.0
1,1,5665,97.0


Then, we merged the new variable "colabora" with the dataframe that identifies the name of institution

In [37]:
longc=long3.drop(columns=['domain'])
df_collabora2=pd.merge(longc,df_collabora,on='docid',indicator=True)
df_collabora2 = df_collabora2[(df_collabora2['_merge']=='both')]
df_collabora2.drop(columns=['_merge'],inplace=True)
df_collabora2=df_collabora2.drop_duplicates()
df_collabora3=df_collabora2.dropna(subset=['i_x','t'])

Once we grouped by institution and the times that they participated in a document made in collaboration, we could obtain the five most relevant actors in the coauthorship of publications.

The chart illustrates that the Centre National de la Recherche Scientifique is the leading actor in research networks in Grenoble since it has produced 3264 publications in coauthorship with other institutions.


In [38]:
df_collabora3['collabora'] = df_collabora3['collabora'].apply(pd.to_numeric, errors='coerce')
df_collabora_inst=df_collabora3.groupby(['i_x','t'])['collabora'].agg('count').reset_index().sort_values(['collabora'],ascending=False)
df_collabora.rename(columns={'i_x': 'institution','t':'type','collabora':'publications in coauthorship'},inplace=True)
df_collabora_inst.head(10)

Unnamed: 0,i_x,t,collabora
1190,Centre National de la Recherche Scientifique,institution,3264
5149,Université Grenoble Alpes,institution,1299
5177,Université Joseph Fourier - Grenoble 1,institution,1277
5251,Université Pierre Mendès France - Grenoble 2,institution,990
3090,Institut polytechnique de Grenoble - Grenoble ...,institution,565
2787,Institut National Polytechnique de Grenoble,institution,512
2849,Institut Polytechnique de Grenoble - Grenoble ...,institution,496
2796,Institut National de Recherche en Informatique...,institution,419
5362,Université de Lyon,regroupinstitution,372
2821,Institut National des Sciences Appliquées,regroupinstitution,305


In [39]:
#p = sns.catplot(x="i_y", y='collabora',data=df_collabora2, kind="bar")

## (optional) task 4: keywords

Knowing production per scientific disciplines is often not enough for research policy-makers. It is often
necessary to get to a deeper level of understanding of the content of scientific publications. Thus, for
this, you can try and extract keywords from the abstract of publications. To carry out this task, you can
use the techniques, tools or libraries that you want. You will

#### 4a. extract keywords from all the available english abstracts and

In order to analyze texts, we explored the data and do a brief cleaning of it. First, it was necessary to keep with the columns of interest (id of publications and abstracts) and remove the rows where abstracts were missing. We obtained 1,688 english abstracts of publications.

In [108]:
texts=original[['docid','en_abstract_s']]
texts.dropna(subset=["en_abstract_s"],inplace=True)
texts.rename(columns={'en_abstract_s': 'abstract'},inplace=True)

In [109]:
texts.count()

docid       1688
abstract    1688
dtype: int64

To prepare the abstracts of publications in Grenoble before applying the LDA model, we applied cleaning and tokenizations proceedings.

In [110]:
import re, nltk
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize


def clean_abstract(abstract):
    return " ".join(re.sub("b'(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])", " ", abstract.lower()).split())
texts['abstract'] = texts['abstract'].astype(str)
texts['abstract'] = texts['abstract'].apply(clean_abstract)

In [111]:
from nltk.tokenize import word_tokenize
#Tokenizar frases
texts['abstract'] = texts.apply(lambda row: word_tokenize(row['abstract']), axis=1)
texts['abstract'].head()

stopwords_set = set(stopwords.words("english"))
texts['abstract'] = list(map(lambda line: list(filter(lambda word: word not in stopwords_set, line)), texts.abstract))

In [112]:
texts_per_instit=pd.merge(long3,texts,how="left",on='docid',indicator=True)
texts_per_instit = texts_per_instit[(texts_per_instit['_merge']=='both')]
texts_per_instit.drop(columns=['_merge'],inplace=True)
texts_per_instit.head()

Unnamed: 0,docid,anio,i,t,domain,abstract
29,171064,2009,Université Paris Diderot - Paris 7,institution,Physique,"[supernovae, integral, field, spectrograph, ke..."
30,171064,2009,Université Paris Diderot - Paris 7,institution,Planète et Univers,"[supernovae, integral, field, spectrograph, ke..."
31,171064,2009,Institut national des sciences de l'Univers,institution,Physique,"[supernovae, integral, field, spectrograph, ke..."
32,171064,2009,Institut national des sciences de l'Univers,institution,Planète et Univers,"[supernovae, integral, field, spectrograph, ke..."
33,171064,2009,Institut de Physique Nucléaire de Lyon,laboratory,Physique,"[supernovae, integral, field, spectrograph, ke..."


#### 4b. identify the most recurrent keywords attached to the publications of local actors.

Applying CountVectorizer library, we could extract keywords from the abstracts of publications In Grenoble. Then we create a table that provides information of the 20 top words in the abstracts, which could be linked to research methodologies wordsas: 'data based paper', 'model results', 'analysis approach' or 'different method'

In [113]:
texts['abstract'] = texts['abstract'].astype(str)
data = texts['abstract'].values.tolist()

In [62]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(lowercase=True,min_df=10)
#Transformamos los documentos en una matriz de tf's de documentos.
vectorizer.fit(data)
print ("Atributos:",vectorizer.get_feature_names()) #El vectorizador aprende el vocabulario del corpus
#Extraemos las frecuencias de palabras (tf)
data_vectorized = vectorizer.transform(data) 


Atributos: ['10', '100', '11', '12', '13', '14', '15', '16', '17', '20', '200', '2000', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '21', '25', '2d', '30', '3d', '50', '60', 'ability', 'able', 'absence', 'absorption', 'abstract', 'abstraction', 'academic', 'access', 'according', 'account', 'accounting', 'accuracy', 'accurate', 'accurately', 'achieve', 'achieved', 'achieves', 'acoustic', 'acquired', 'acquisition', 'across', 'act', 'action', 'actions', 'active', 'activit', 'activities', 'activity', 'actors', 'actual', 'actually', 'ad', 'adapt', 'adaptation', 'adapted', 'adapting', 'adaptive', 'add', 'added', 'adding', 'addition', 'additional', 'address', 'addressed', 'addresses', 'addressing', 'adequate', 'administrative', 'adopt', 'adopted', 'adoption', 'adsorption', 'advanced', 'advantage', 'advantages', 'aerosol', 'affect', 'affected', 'affects', 'age', 'ageing', 'agency', 'agent', 'agents', 'agreement', 'agriculture'

The table provides information of the 20 top words in the abstracts, which are linked to research methodologies wordsas: 'data based paper', 'model results', 'analysis approach' or 'different method'

In [63]:
from sklearn.feature_extraction.text import CountVectorizer

def get_top_n_words(corpus, n=None):
    vec = CountVectorizer(stop_words = 'english').fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

common_words = get_top_n_words(data, 20)
    
df2 = pd.DataFrame(common_words, columns = ['abstract' , 'count'])
df2

Unnamed: 0,abstract,count
0,based,687
1,paper,675
2,data,561
3,results,546
4,model,534
5,new,487
6,time,483
7,using,470
8,analysis,456
9,approach,447


## (optional) task 5: clustering texts
Beyond the extraction of keywords, it would be useful to cluster the texts of the abstracts in
accordance with their content. This exercise enables one in practice to extract topics emerging from
the corpus of identified document. For this task, you might identify groups of publications which have
similar abstracts. You can use the clustering technique (either soft - LDA, Gaussian Mixtures etc - or
hard - K-means, hierarchical clustering-) that you deem appropriate. In particular, you could

#### 5a. show which are the textual terms that characterise each cluster, and check how many publications there are per cluster;

In this section, it was applied LDA method because it allows us to group the abstracts by dominant topics. I considered this method provides a better analysis in contrast to K-means method given the latter one just split N documents in K disjoint clusters or  topics in this case, while LDA assigns a document to a mixture of topics. Thus, each document is characterized by one or more topics. 

In [64]:
import re, nltk
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Sklearn
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
from pprint import pprint

# Plotting tools
import pyLDAvis
import pyLDAvis.sklearn
import matplotlib.pyplot as plt
%matplotlib inline



Once the array is ready, we build a Latent Dirichlet Allocation (LDA) model, applying fit_transform(). Here, I have set the n_topics as 20 until find the optimal number using grid search.

In [65]:
# Build LDA Model
lda_model = LatentDirichletAllocation(n_topics=20,               # Number of topics
                                      max_iter=10,               # Max learning iterations
                                      learning_method='online',   
                                      random_state=100,          # Random state
                                      batch_size=128,            # n docs in each learning iter
                                      evaluate_every = -1,       # compute perplexity every n iters, default: Don't
                                      n_jobs = -1,               # Use all available CPUs
                                     )
lda_output = lda_model.fit_transform(data_vectorized)

print(lda_model)  # Model attributes



LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='online', learning_offset=10.0,
             max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,
             n_components=10, n_jobs=-1, n_topics=20, perp_tol=0.1,
             random_state=100, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)


Given that a model with a high log-likelihood and low perplexity are considered to be good, our model results need to be improved.

In [66]:
# Log Likelyhood: Higher the better
print("Log Likelihood: ", lda_model.score(data_vectorized))
# Perplexity: 
print("Perplexity: ", lda_model.perplexity(data_vectorized))
# See model parameters
pprint(lda_model.get_params())

Log Likelihood:  -720199.0302300556
Perplexity:  1348.0120058928117
{'batch_size': 128,
 'doc_topic_prior': None,
 'evaluate_every': -1,
 'learning_decay': 0.7,
 'learning_method': 'online',
 'learning_offset': 10.0,
 'max_doc_update_iter': 100,
 'max_iter': 10,
 'mean_change_tol': 0.001,
 'n_components': 10,
 'n_jobs': -1,
 'n_topics': 20,
 'perp_tol': 0.1,
 'random_state': 100,
 'topic_word_prior': None,
 'total_samples': 1000000.0,
 'verbose': 0}


Here, I applied the GridSearchCV method for tuning some parameters of LDA model: n_components (number of topics) and learning_decay (which controls the learning rate).

In [67]:
# parameters
search_params = {'n_components': [10, 15, 20, 25], 'learning_decay': [.5, .7]}
# the Model
lda = LatentDirichletAllocation(max_iter=5, learning_method='online', learning_offset=50.,random_state=0)
model = GridSearchCV(lda, param_grid=search_params)
# Do the Grid Search
model.fit(data_vectorized)
GridSearchCV(cv=None, error_score='raise',
       estimator=LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7, learning_method=None,
             learning_offset=10.0, max_doc_update_iter=100, max_iter=10,
             mean_change_tol=0.001, n_components=10, n_jobs=1,
             n_topics=None, perp_tol=0.1, random_state=None,
             topic_word_prior=None, total_samples=1000000.0, verbose=0),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_topics': [10, 15, 20, 25], 'learning_decay': [0.5, 0.7]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)


GridSearchCV(cv=None, error_score='raise',
       estimator=LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7, learning_method=None,
             learning_offset=10.0, max_doc_update_iter=100, max_iter=10,
             mean_change_tol=0.001, n_components=10, n_jobs=1,
             n_topics=None, perp_tol=0.1, random_state=None,
             topic_word_prior=None, total_samples=1000000.0, verbose=0),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_topics': [10, 15, 20, 25], 'learning_decay': [0.5, 0.7]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

As a result, we obtain an optimized model with learning_decay of 0.5, and with n_components equal to 10.

In [68]:
# Best Model
best_lda_model = model.best_estimator_
# Model Parameters
print("Best Model's Params: ", model.best_params_)
# Log Likelihood Score
print("Best Log Likelihood Score: ", model.best_score_)
# Perplexity
print("Model Perplexity: ", best_lda_model.perplexity(data_vectorized))

Best Model's Params:  {'learning_decay': 0.5, 'n_components': 10}
Best Log Likelihood Score:  -257136.9388205867
Model Perplexity:  1273.989773619191


Now, we identify the contribution of topics on each abstract. Selecting the highest score, we can identify the dominant topic in each row.

In [80]:
# Create Document — Topic Matrix
lda_output = best_lda_model.transform(data_vectorized)
topicnames = ["Topic" + str(i) for i in range(best_lda_model.n_components)]
docnames = ["Doc" + str(i) for i in range(len(data))]
df_document_topic = pd.DataFrame(np.round(lda_output, 2), columns=topicnames, index=texts['docid'])

dominant_topic = np.argmax(df_document_topic.values, axis=1)
df_document_topic['dominant_topic'] = dominant_topic

# color
def color_red(val):
 color = 'red' if val > .1 else 'black'
 return 'color: {col}'.format(col=color)
def make_bold(val):
 weight = 700 if val > .1 else 400
 return 'font-weight: {weight}'.format(weight=weight)

df_document_topics = df_document_topic.head(15).style.applymap(color_red).applymap(make_bold)
df_document_topics


Unnamed: 0_level_0,Topic0,Topic1,Topic2,Topic3,Topic4,Topic5,Topic6,Topic7,Topic8,Topic9,dominant_topic
docid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
171064,0.02,0.26,0.0,0.04,0.0,0.0,0.0,0.4,0.0,0.28,7
171587,0.0,0.0,0.0,0.0,0.0,0.73,0.0,0.11,0.0,0.15,5
187631,0.0,0.17,0.0,0.0,0.0,0.0,0.0,0.82,0.0,0.0,7
302871,0.0,0.0,0.0,0.0,0.02,0.21,0.0,0.76,0.0,0.0,7
337717,0.0,0.0,0.01,0.06,0.0,0.33,0.0,0.59,0.0,0.0,7
339479,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.93,0.0,0.0,7
350097,0.0,0.03,0.0,0.02,0.38,0.0,0.0,0.37,0.0,0.19,4
351931,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.95,9
351935,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.94,9
351946,0.0,0.54,0.0,0.12,0.0,0.0,0.04,0.0,0.0,0.3,1


In [70]:
#df_topic_distribution = df_document_topic['dominant_topic'].value_counts().reset_index(name="Num Documents")
#df_topic_distribution.columns = ['Topic Num', 'Num Documents']

Unnamed: 0,Topic Num,Num Documents
0,7,722
1,4,260
2,5,218
3,9,210
4,8,128
5,1,77
6,0,42
7,3,20
8,6,9
9,2,2


In [71]:
# Topic-Keyword Matrix
df_topic_keywords = pd.DataFrame(best_lda_model.components_)
# Assign Column and Index
df_topic_keywords.columns = vectorizer.get_feature_names()
df_topic_keywords.index = topicnames
# View
df_topic_keywords

Unnamed: 0,10,100,11,12,13,14,15,16,17,20,...,would,writing,written,year,years,yet,yield,yields,zero,zone
Topic0,19.253489,0.100519,39.358496,23.000299,0.102537,0.100542,0.100638,5.029134,0.173658,0.156991,...,6.711733,0.100713,0.101459,0.100693,18.948746,0.100703,0.10052,0.100432,0.100295,0.100258
Topic1,4.981478,0.100268,0.100572,0.100599,0.100547,3.090258,0.128208,0.105792,0.100524,2.230243,...,9.790795,8.033612,24.471779,0.28856,10.679981,2.837592,0.100285,0.108367,0.138167,0.100339
Topic2,0.100448,0.100225,0.100315,0.100302,0.100314,0.100267,0.100218,0.100256,0.100292,0.100233,...,0.102198,0.100225,0.100245,0.10024,0.10365,0.100378,0.100244,0.100256,0.100217,0.100294
Topic3,11.094409,5.719707,0.104077,0.101177,0.100338,0.100464,0.100934,0.100358,0.100285,0.101963,...,0.100471,0.100229,0.100276,0.108547,0.100622,0.10167,0.100351,0.100292,0.100375,0.101466
Topic4,3.144752,0.100335,0.100949,0.102151,0.10041,3.77351,0.101309,2.272642,0.102331,0.112542,...,10.856304,0.106653,0.106625,14.488151,43.597675,5.535585,0.100245,0.100384,0.100311,0.100624
Topic5,0.382121,1.830693,0.208615,0.170154,4.592629,0.224023,4.55282,4.82712,0.1006,0.100903,...,0.101329,0.101016,9.197653,0.100277,0.103996,0.100642,4.569602,3.794715,9.102056,0.115394
Topic6,0.10112,0.100233,0.100241,0.100627,0.100237,0.100258,0.100292,0.100346,0.100254,0.101123,...,0.10039,0.100275,0.10028,0.100243,0.100341,0.100347,0.100275,0.100254,0.100261,0.100328
Topic7,5.354598,0.279836,1.977993,8.872853,1.711987,0.109805,5.058288,3.770538,12.927741,0.185878,...,22.323274,3.588844,0.457508,4.44634,28.642756,22.802565,0.105799,7.681828,4.858284,0.288348
Topic8,2.045564,0.101575,3.087009,0.10797,0.114939,1.339929,5.771012,2.42373,0.103254,0.100447,...,0.102094,0.100281,0.104725,0.163166,0.115173,0.102776,0.100321,0.101448,0.101215,1.441432
Topic9,31.427789,21.895025,0.125102,6.277795,13.04709,7.380687,10.775451,4.245344,3.28738,23.323285,...,0.104439,0.100512,0.100597,0.101069,0.107657,0.101168,15.175,2.182275,0.172567,20.214408


When it comes to show which are the textual terms that characterise each topic, we can see top 10 words by topic

In [72]:
# Show top n keywords for each topic
def show_topics(vectorizer=vectorizer, lda_model=lda_model, n_words=10):
    keywords = np.array(vectorizer.get_feature_names())
    topic_keywords = []
    for topic_weights in lda_model.components_:
        top_keyword_locs = (-topic_weights).argsort()[:n_words]
        topic_keywords.append(keywords.take(top_keyword_locs))
    return topic_keywords
topic_keywords = show_topics(vectorizer=vectorizer, lda_model=best_lda_model, n_words=15)
# Topic - Keywords Dataframe
df_topic_keywords = pd.DataFrame(topic_keywords)
df_topic_keywords.columns = ['Word '+str(i) for i in range(df_topic_keywords.shape[1])]
df_topic_keywords.index = ['Topic '+str(i) for i in range(df_topic_keywords.shape[0])]
df_topic_keywords

Unnamed: 0,Word 0,Word 1,Word 2,Word 3,Word 4,Word 5,Word 6,Word 7,Word 8,Word 9,Word 10,Word 11,Word 12,Word 13,Word 14
Topic 0,speech,nanotubes,exposure,study,nanomaterials,metal,nano,tio2,carbon,nanoparticles,vocal,production,acoustic,products,potential
Topic 1,language,french,text,learning,study,corpus,information,languages,analysis,could,present,results,play,search,teaching
Topic 2,risks,risk,services,service,cloud,actions,mobility,towards,uncertainties,protocols,protocol,uses,oriented,daily,responsible
Topic 3,surface,frequency,waves,wave,absorption,hybrid,temperature,spectroscopy,large,si,molecules,interactions,quantum,single,origin
Topic 4,social,research,european,new,urban,territorial,local,public,management,also,political,economic,actors,territory,innovation
Topic 5,method,model,paper,based,problem,approach,using,algorithm,time,system,image,linear,systems,control,space
Topic 6,graph,graphs,number,minimal,degree,rewriting,proof,algorithms,contains,edges,bounds,set,prove,vertices,every
Topic 7,data,based,paper,system,new,model,approach,systems,time,used,analysis,different,models,information,results
Topic 8,de,et,la,des,les,le,une,en,un,dans,du,nous,sur,es,est
Topic 9,high,low,power,results,using,experimental,time,present,properties,used,flow,size,based,design,paper


We get that the main topics are related to "Policy Public Research","Social/territorial research","french words","Public policy and state","Innovation region", "Energy structure", "spatial model", "tool/study/method", "emperature/voltage/metal", "base/result/design"

In [81]:
Topics = ["Nanomaterials/Nanotubes/products","Languages/text/analysis","Mobility/risks/services","Molecules/temperature/frequency","Social/political/research", 
          "System/method/paper", "Graphs/edges/bound", "data/analysis/models", "de/et/une", "base/results/design/properties"]
df_topic_keywords["Topics"]=Topics
df_topic_keywords

Unnamed: 0,Word 0,Word 1,Word 2,Word 3,Word 4,Word 5,Word 6,Word 7,Word 8,Word 9,Word 10,Word 11,Word 12,Word 13,Word 14,Topics
Topic 0,speech,nanotubes,exposure,study,nanomaterials,metal,nano,tio2,carbon,nanoparticles,vocal,production,acoustic,products,potential,Nanomaterials/Nanotubes/products
Topic 1,language,french,text,learning,study,corpus,information,languages,analysis,could,present,results,play,search,teaching,Languages/text/analysis
Topic 2,risks,risk,services,service,cloud,actions,mobility,towards,uncertainties,protocols,protocol,uses,oriented,daily,responsible,Mobility/risks/services
Topic 3,surface,frequency,waves,wave,absorption,hybrid,temperature,spectroscopy,large,si,molecules,interactions,quantum,single,origin,Molecules/temperature/frequency
Topic 4,social,research,european,new,urban,territorial,local,public,management,also,political,economic,actors,territory,innovation,Social/political/research
Topic 5,method,model,paper,based,problem,approach,using,algorithm,time,system,image,linear,systems,control,space,System/method/paper
Topic 6,graph,graphs,number,minimal,degree,rewriting,proof,algorithms,contains,edges,bounds,set,prove,vertices,every,Graphs/edges/bound
Topic 7,data,based,paper,system,new,model,approach,systems,time,used,analysis,different,models,information,results,data/analysis/models
Topic 8,de,et,la,des,les,le,une,en,un,dans,du,nous,sur,es,est,de/et/une
Topic 9,high,low,power,results,using,experimental,time,present,properties,used,flow,size,based,design,paper,base/results/design/properties


And now we can check how many publications are per cluster. We see that the predominant topic is topic 7 which would be linked to "data/analysis/models", followed by topic 4 ("Social/political/research") and topic 5 ("System/method/paper").

In [97]:
df_topic_distribution = df_document_topic['dominant_topic'].value_counts().reset_index(name="Num Abstracts")
df_topic_distribution.columns = ['Topic Num', 'Num Abstracts']
df_topic_distribution['Topic']=df_topic_distribution['Topic Num']
d2m={"Topic":{ 0: "Nanomaterials/Nanotubes/products",1:"Languages/text/analysis",2: "Mobility/risks/services",\
               3:"Molecules/temperature/frequency",4:"Social/political/research", 5: "System/method/paper", \
               6:"Graphs/edges/bound",7: "data/analysis/models", 8:"de/et/une", 9:"base/results/design/properties"}}
df_topic_distribution.replace(d2m,inplace=True)
df_topic_distribution

Unnamed: 0,Topic Num,Num Abstracts,Topic
0,7,722,data/analysis/models
1,4,260,Social/political/research
2,5,218,System/method/paper
3,9,210,base/results/design/properties
4,8,128,de/et/une
5,1,77,Languages/text/analysis
6,0,42,Nanomaterials/Nanotubes/products
7,3,20,Molecules/temperature/frequency
8,6,9,Graphs/edges/bound
9,2,2,Mobility/risks/services


The pyLDAvis offers a view of topics-keywords distribution. 

In [75]:
pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(best_lda_model, data_vectorized, vectorizer, mds='tsne')
panel

In [98]:
df_document_topic.reset_index()
total=pd.merge(long3,df_document_topic,how="left",on='docid',indicator=True)
total = total[(total['_merge']=='both')]
total.drop(columns=['_merge'],inplace=True)
total['topic']=total['dominant_topic']
total.head()

Unnamed: 0,docid,anio,i,t,domain,Topic0,Topic1,Topic2,Topic3,Topic4,Topic5,Topic6,Topic7,Topic8,Topic9,dominant_topic,topic
29,171064,2009,Université Paris Diderot - Paris 7,institution,Physique,0.02,0.26,0.0,0.04,0.0,0.0,0.0,0.4,0.0,0.28,7.0,7.0
30,171064,2009,Université Paris Diderot - Paris 7,institution,Planète et Univers,0.02,0.26,0.0,0.04,0.0,0.0,0.0,0.4,0.0,0.28,7.0,7.0
31,171064,2009,Institut national des sciences de l'Univers,institution,Physique,0.02,0.26,0.0,0.04,0.0,0.0,0.0,0.4,0.0,0.28,7.0,7.0
32,171064,2009,Institut national des sciences de l'Univers,institution,Planète et Univers,0.02,0.26,0.0,0.04,0.0,0.0,0.0,0.4,0.0,0.28,7.0,7.0
33,171064,2009,Institut de Physique Nucléaire de Lyon,laboratory,Physique,0.02,0.26,0.0,0.04,0.0,0.0,0.0,0.4,0.0,0.28,7.0,7.0


In [101]:
texts_instit=total.groupby(['dominant_topic','i','t'])['docid'].agg('count').reset_index().sort_values(['dominant_topic','docid'],ascending=False)

#### 5b. analyse which local actors are producing publications in each of the clusters.

In the next table, we can see the top 2 institutions by dominant topic, but sorted by num of publications. We can infer that publications of Centre National de la Recherche Scientifique and Université Joseph Fourier address topics like data analysis properties and design of results (topic 9). IN contrast, the publications of University of Joseph Fourier focused mainly in data analysis and models (topic 7) while Université Claude Bernard in Nanomaterials and Nanotubes related subject.

In [104]:
df1 = texts_instit.sort_values('docid',ascending = False).groupby('dominant_topic').head(2)
df1.rename(columns={'i':'institution','docid': 'num publications'},inplace=True)

df1['Topic']=df1['dominant_topic']
d2m={"Topic":{ 0: "Nanomaterials/Nanotubes/products",1:"Languages/text/analysis",2: "Mobility/risks/services",\
               3:"Molecules/temperature/frequency",4:"Social/political/research", 5: "System/method/paper", \
               6:"Graphs/edges/bound",7: "data/analysis/models", 8:"de/et/une", 9:"base/results/design/properties"}}
df1.replace(d2m,inplace=True)


df1.sort_values('dominant_topic',ascending = False)

Unnamed: 0,dominant_topic,institution,t,num publications,Topic
4428,9.0,Université Joseph Fourier - Grenoble 1,institution,78,base/results/design/properties
4011,9.0,Centre National de la Recherche Scientifique,institution,208,base/results/design/properties
3546,8.0,Centre National de la Recherche Scientifique,institution,65,de/et/une
3871,8.0,Université de Lyon,regroupinstitution,21,de/et/une
2064,7.0,Centre National de la Recherche Scientifique,institution,532,data/analysis/models
3249,7.0,Université Joseph Fourier - Grenoble 1,institution,198,data/analysis/models
1726,6.0,Université de Montpellier,institution,5,Graphs/edges/bound
1713,6.0,Laboratoire d'Informatique de Robotique et de ...,institution,6,Graphs/edges/bound
1163,5.0,Centre National de la Recherche Scientifique,institution,179,System/method/paper
1595,5.0,Université Joseph Fourier - Grenoble 1,institution,86,System/method/paper
