# Prep_400
This notebook deals with the preparation of my fourth research question 'What python packages are being imported together regularly?'

## Purpose
* In this notebook I will be creating lists of packages imported within each notebook to be used for apriori analysis

    * I will begin be reading in a dataframe of notebook info (including packages imported)
    * The package column will be cleaned and a dataframe with lists of imports will be created
    * This dataframe will then be transformed to a string list
    * This string list will then be saved to a txt file 

In [1]:
#importing relevant libraries
import os
import json
import numpy as np
import pandas as pd
import seaborn as sns
import re
import time
import datetime

### Reading in dataframe

In [2]:
df_nb = pd.read_csv('../data/CSV_files/new_cell_info.csv')

In [3]:
df_nb.head()

Unnamed: 0,nb_id,nb_language,worksheet_index,cell_index,num_imports,imports,num_headings,headings,header_level
0,400000,julia,0,,1,"['DataFrames,']",0,[],[]
1,400000,julia,1,,0,[],0,[],[]
2,400000,julia,2,,0,[],0,[],[]
3,400000,julia,3,,0,[],0,[],[]
4,400000,julia,4,,0,[],0,[],[]


### Cleaning 'imports' column

In [4]:
df_nb['imports'] = pd.DataFrame([str(line).strip('[').strip(']').replace("'","") for line in df_nb['imports']])

Removing square brackets from imports column

In [5]:
df_imports = df_nb.replace('', np.nan) # replacing space with NaN values

In [6]:
df_imports = df_imports.dropna(subset=['imports']) #droping rows with no imports

In [7]:
df_imports.head()

Unnamed: 0,nb_id,nb_language,worksheet_index,cell_index,num_imports,imports,num_headings,headings,header_level
0,400000,julia,0,,1,"DataFrames,",0,[],[]
22,400002,python,0,,1,sympy,0,[],[]
41,400004,python,1,,6,"gcp, gcp.storage, gcp.context, random, pandas,...",0,[],[]
49,400004,python,9,,1,inspect,0,[],[]
60,400005,python,1,,9,"matplotlib.pyplot, numpy, os, tarfile, urllib,...",0,[],[]


In [8]:
df_imports = df_imports[['nb_id', 'imports']]

Creating dataframe with just imports and notebook ID 

In [9]:
foo = lambda a: ", ".join(a)
dfImports = df_imports.groupby('nb_id').agg({'imports': foo})

Join import lists based on notebook ID. This is so analysis can be done by notebook rather than by cell. (ie. some people might import in multiple cells of a notebook)

In [10]:
dfImports.head()

Unnamed: 0_level_0,imports
nb_id,Unnamed: 1_level_1
400000,"DataFrames,"
400002,sympy
400004,"gcp, gcp.storage, gcp.context, random, pandas,..."
400005,"matplotlib.pyplot, numpy, os, tarfile, urllib,..."
400006,"gensim, numpy, __future__, time, sklearn.featu..."


In [11]:
dfImports = dfImports['imports'].str.split(',', expand=True).reset_index()

Expanding the imports out to multiple columns.

In [12]:
dfImports.head()

Unnamed: 0,nb_id,0,1,2,3,4,5,6,7,8,...,294,295,296,297,298,299,300,301,302,303
0,400000,DataFrames,,,,,,,,,...,,,,,,,,,,
1,400002,sympy,,,,,,,,,...,,,,,,,,,,
2,400004,gcp,gcp.storage,gcp.context,random,pandas,StringIO,inspect,,,...,,,,,,,,,,
3,400005,matplotlib.pyplot,numpy,os,tarfile,urllib,IPython.display,scipy,sklearn.linear_model,cPickle,...,,,,,,,,,,
4,400006,gensim,numpy,__future__,time,sklearn.feature_extraction.text,sklearn.decomposition,sklearn.datasets,sklearn.datasets,gensim,...,,,,,,,,,,


In [13]:
dfImports.fillna(value=pd.np.nan, inplace=True) # replacing None values with Nan

In [14]:
dfImports = dfImports.drop(['nb_id'], axis = 1)  # dropping nb_id column

In [15]:
dfImports = dfImports.replace('nan', np.nan) #replacing nan with NaN

In [16]:
dfImports = dfImports.replace(np.nan, '', regex=True) #replacing all NaN values with empty space

Replacing NaN values id so that NaN is not counted as a package in future analysis

In [17]:
dfImports.shape

(127001, 304)

127,001 lists of packages

### Converting dataframe to sting list

In [18]:
records = []  
for i in range(0, 127001):  
    records.append(list(set([str(dfImports.values[i,j]) for j in range(0, 304)])))

In [19]:
cleanedList = [x for x in records if str(x) != 'nan']

Again getting rid of nan values in the list

In [20]:
str_list = [list(filter(None, lst)) for lst in records]

In [21]:
result = [[s.strip() for s in inner] for inner in str_list]

Filtering out empty strings to make a cleaned list

In [22]:
result

[['DataFrames'],
 ['sympy'],
 ['pandas',
  'gcp',
  'gcp.storage',
  'random',
  'inspect',
  'StringIO',
  'gcp.context'],
 ['urllib',
  'cPickle',
  'os',
  'numpy',
  'sklearn.linear_model',
  'scipy',
  'IPython.display',
  'tarfile',
  'matplotlib.pyplot'],
 ['gensim',
  'sklearn.feature_extraction.text',
  'collections',
  'sklearn.datasets',
  '__future__',
  'sklearn.decomposition',
  'numpy',
  'time',
  'gensim'],
 ['os',
  'matplotlib.pyplot',
  'lasagne',
  'os',
  'theano.tensor',
  'numpy',
  'theano',
  'lasagne.layers',
  'IPython.display',
  'gym',
  'gym.wrappers'],
 ['math'],
 ['pandas', 'datetime'],
 ['ThreeVector',
  'FutureColliderTools',
  'sys',
  'pickle',
  'numpy',
  'FourVector',
  'ROOT',
  'FutureColliderDataLoader'],
 ['pyspark.sql', 'pyspark.sql.types'],
 ['pandas',
  'random',
  'sklearn.neighbors',
  'matplotlib.pyplot',
  'sklearn.metrics',
  'sklearn.model_selection',
  'numpy',
  'sklearn',
  'sklearn.cross_validation'],
 ['pyspark.sql', 'pyspark.sq

## Saving list to text file

In [23]:
import pickle
with open("../data/Txt_files/apriori.txt", "wb") as fp:   #Pickling
    pickle.dump(result, fp)
with open("../data/Txt_files/apriori.txt", "rb") as fp:   # Unpickling
    Imports_apr = pickle.load(fp)