# Data Wrangling with Python
---
---
The Process:
1. [Import dependencies and configure settings](#Step-1:-Import-dependencies-and-configure-settings)
2. [Load the data into python](#Step-2:-Load-the-data-into-python)
3. [Create data views or new datasets by making more python objects](#Step-3:-Create-data-views-or-new-datasets-by-making-more-python-objects)
4. [Lock down the process and store the final result in a reliable location](#Step-4:-Lock-down-the-process-and-store-the-final-result-in-a-reliable-location)

![Lego-Data](../images/data_wrangling.jpg "Lego Data Wrangling")

---

## Step 1: Import dependencies and configure settings
### Import packages, modules, objects, or functions into the current notebook

***TIP: Avoid 'from PACKAGE import \*' syntax as it clutters your namespace with objects you may not be aware of***

In [None]:
# BEST PRACTICE: import built-in packages first
# from PACKAGE import OBJECT lets us bring only what we need into our namespace
from string import punctuation
from warnings import filterwarnings
# import PACKAGE, brings the all modules into one object named PACKAGE
import re

# BEST PRACTICE: import third-party packages second
# import PACKAGE as ALIAS, places all the modules into one object named ALIAS
from contractions import fix as contractions_fix
from gensim.corpora import Dictionary as gensim_Dictionary
from gensim.models import Phrases as gensim_Phrases
from gensim.models.phrases import Phraser as gensim_Phraser
from geopy.geocoders import Nominatim as geopy_Nominatim
from joblib import dump as joblib_dump
from nltk import pos_tag as nltk_pos_tag
from nltk.tokenize import word_tokenize as nltk_word_tokenize
from nltk.corpus import (
    # use parenthesis to import multiple objects and even add aliases
    stopwords as nltk_stopwords,
    wordnet as nltk_wordnet
)
from nltk.stem import (
    SnowballStemmer as nltk_SnowballStemmer,
    WordNetLemmatizer as nltk_WordNetLemmatizer
)
from nltk.tag import pos_tag as nltk_pos_tag
from sklearn.feature_extraction.text import (
    CountVectorizer as sklearn_CountVectorizer,
    TfidfTransformer as sklearn_TfidfTransformer,
    TfidfVectorizer as sklearn_TfidfVectorizer
)
# aliases can be added when fully importing packages
import bamboolib as bam
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy as sp
# aliases can also be used when importing an individual module from the package
import statsmodels.api as sm

# BEST PRACTICE: import custom packages last
# for example: import custom_module as cm

### Configure settings by using functions or methods (always check the documentation)
Third-party packages used in this notebook:
- [Bamboolib](https://docs.bamboolib.8080labs.com/documentation/getting-started)
- [Contractions](https://github.com/kootenpv/contractions)
- [Gensim](https://radimrehurek.com/gensim/)
- [Geopy](https://github.com/geopy/geopy)
- [Matplotlib](https://matplotlib.org/)
- [NLTK](https://www.nltk.org/)
- [NumPy](https://numpy.org/doc/stable/user/index.html)
- [Pandas](https://pandas.pydata.org/pandas-docs/version/1.2.4/user_guide/index.html)
- [pyLDAvis](https://github.com/bmabey/pyLDAvis)
- [Scikit Learn](https://scikit-learn.org/stable/)
- [SciPy](https://docs.scipy.org/doc/scipy/tutorial/index.html)
- [Statsmodels](https://www.statsmodels.org/devel/user-guide.html)

In [None]:
# set python shell filter out warnings and avoid cluttering outputs
filterwarnings(
    'ignore'
)

In [None]:
# set custom pandas package options by iterating through key-value pairs in a dictionary
for option, value in { # dictionaries are denoted by curly brackets {} or the dict() function
    'display.max_columns': 50,
    'display.max_colwidth': None,
    'display.max_info_columns': 50,
    'display.max_rows': 20,
    'display.precision': 4
}.items(): # the .items() function of a dictionary lets us iterate through key, value pairs
    # we can call a function on the variable we set for the objects we're iterating over
    # in this case those variables are 'option', and 'value' and they represent
    # the key, value pairs from the dictionary above
    pd.set_option(
        option, # this will be 'display.max_columns' etc..
        value   # this will be 50, None, etc..
    )

[Top](#Data-Wrangling-with-Python)

---

## Step 2: Load the data into python
### Use tools that 'open' files and load their contents into objects
NOTE: It is possible to build custom functions to read files, but not advised as there are many python libraries that can do so while providing enhanced performance and additional functionality

In [None]:
# load each dataset into a pandas DataFrame object
cocoon_pharmacy_df = pd.read_csv(
    # the read_csv function from the pandas package will read the file at this location
    '../data/cocoon_pharmacy_location_added.csv',
    # this .csv was saved from a dataframe with the index in the first position
    index_col = 0
)
data_literacy_df = pd.read_csv(
    # the '..' notation indicates that parsing should begin at the parent folder 
    # of this notebook, so if this notebook is in the 'notebooks/' folder then
    # '..' would map to the root folder of this repository (the parent of 'notebooks/') 
    '../data/data_literacy_questionnaire.csv'
)
data_journey_df = pd.read_csv(
    '../data/data_journey_questionnaire.csv'
)
meeting_cadence_df = pd.read_csv(
    '../data/meeting_cadence_survey.csv'
)

[Top](#Data-Wrangling-with-Python)

---

## Step 3: Create data views or new datasets by making more python objects

### Option 1: Use third-party tools for transforming data
Bamboolib is our recommendation as it offers a great bridge into python, but any low/no-code solutions are a great place to get started with these concepts, and many such as JMP from SAS, Minitab, and Power BI also offer scripting integrations

**Getting Started with [Bamboolib](https://docs.bamboolib.8080labs.com/documentation/getting-started):**
- [x] import the package *(completed at the beginning of this notebook)*

- [x] create a Pandas DataFrame *(completed in the code cell above)*

- [ ] display a pandas dataframe to make "Show Bamboolib UI" button available

***TIP: DataFrames can be displayed by using the display() function or by placing the object on the last line of a cell***

In [None]:
# NOTE: Bamboolib will overwrite the contents of this cell after work is complete 
# with any code necessary to execute the requested transformations or plots
cocoon_pharmacy_df

In [None]:
# NOTE: Bamboolib will overwrite the contents of this cell after work is complete 
# with any code necessary to execute the requested transformations or plots
data_literacy_df

In [None]:
# NOTE: Bamboolib will overwrite the contents of this cell after work is complete 
# with any code necessary to execute the requested transformations or plots
data_journey_df

In [None]:
# NOTE: Bamboolib will overwrite the contents of this cell after work is complete 
# with any code necessary to execute the requested transformations or plots
meeting_cadence_df

### Option 2: Write code to transform the data
Use the many packages available in Python to build flexible 'templated' solutions that address re-occurring issues or dynamic problems that require more complex integrations and/or additional capabilities *(Ex. networking, containerization/virtualization, multi-processing)*

**Getting Started with writing 'Production' code:**
- [x] Adhere to a consistent method for structuring and writing code *(see [PEP8](https://peps.python.org/pep-0008/) for more details)*

- [x] Document third-party packages and the versions being used *(Ex. binder/requirements.txt file in this repository)*

- [ ] Always test your code! *(Doesn't have to be automated, but 'good' code should always have error handling and logging)*

***TIP: Check out the [Official Pandas Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) for a Quick Syntax Reference***

In [None]:
body_moisturizers = cocoon_pharmacy_df[cocoon_pharmacy_df['product_cat'] == 'Body aMoisturisers']

In [None]:
def get_wordnet_pos(treebank_tag):

  if treebank_tag.startswith('J'):
      return nltk_wordnet.ADJ
  elif treebank_tag.startswith('V'):
      return nltk_wordnet.VERB
  elif treebank_tag.startswith('N'):
      return nltk_wordnet.NOUN
  elif treebank_tag.startswith('R'):
      return nltk_wordnet.ADV
  else:
      return nltk_wordnet.NOUN  
    
def first_preprocessing(pdf):
  stopwords = nltk_stopwords.words('english')

  eng = pdf.copy(deep = True)
  
  eng['body_review_cleaned'] = eng['body_review'].apply(lambda x: contractions_fix(x.lower().strip())) #lower case, expand contractions, and strip spaces
  eng['body_review_cleaned'] = eng['body_review_cleaned'].apply(lambda x: ' '.join(s for s in x.split() if not any(c.isdigit() for c in s))) #remove anyword containing a digit
  eng['body_review_cleaned'] = eng['body_review_cleaned'].apply(lambda x: "".join([char for char in x if char not in punctuation])) #remove punctuations
  eng['body_review_cleaned'] = eng['body_review_cleaned'].apply(lambda x: ' '.join([w for w in x.split() if w not in stopwords])) #remove stopwords
  eng['body_review_cleaned'] = eng['body_review_cleaned'].apply(lambda x: re.sub("(\")?(\')?(“)?(”)?",'',x)) #remove “ and ”
  eng['body_review_cleaned'] = eng['body_review_cleaned'].apply(lambda x: ' '.join([nltk_WordNetLemmatizer().lemmatize(word,get_wordnet_pos(pos_tag)) for pos in [nltk_pos_tag(x.split())]  for (word,pos_tag) in pos]))
  
  sentence_stream = [doc.split(" ") for doc in eng['body_review_cleaned'].values]
  
  bigram = gensim_Phrases(sentence_stream, min_count=15, threshold=5) # higher threshold fewer phrases.
  trigram = gensim_Phrases(bigram[sentence_stream], min_count=15,threshold=5)  

  # Faster way to get a sentence clubbed as a trigram/bigram
  bigram_mod = gensim_Phraser(bigram)
  trigram_mod = gensim_Phraser(trigram)
  
  return pdf, eng, sentence_stream, bigram_mod, trigram_mod
  
def later_preprocessing(text, bigram_mod, trigram_mod):
    
    text = bigram_mod[text.split()] #bigram
    text = trigram_mod[bigram_mod[text]] #trigram
    text = ' '.join([w.strip() for w in text if len(w.strip()) > 2 and w.strip() not in ['no','qc']]) # remove short words

    return text

In [None]:
pdf, eng, sentence_stream, bigram_mod, trigram_mod = first_preprocessing(cocoon_pharmacy_df)
eng['body_review_cleaned'] = eng['body_review_cleaned'].apply(lambda x: later_preprocessing(x,bigram_mod, trigram_mod))

In [None]:
sentence_streams = [doc.split(" ") for doc in eng['body_review_cleaned'].values]
id2word = gensim_Dictionary(sentence_streams)
id2word.filter_extremes(no_below=7, no_above=0.95, keep_n=25000)
corpus = [id2word.doc2bow(text) for text in sentence_streams]
print('Total number of unique words after filtering extremes: {}'.format(len(id2word)))

In [None]:
joblib_dump(
    corpus,
    '../data/cocoon_reviews_corpus.pkl'
)
joblib_dump(
    id2word,
    '../data/cocoon_reviews_id2word.pkl'
)

[Top](#Data-Wrangling-with-Python)

---

## Step 4: Lock down the process and store the final result in a reliable location

- Always save the data to a secure location, whether that's as a table in a relational database, as a flat file in an online file-system, or as a flat file in a physical drive that is that is consistently backed up *([See Pandas In/Out Methods](https://pandas.pydata.org/docs/reference/io.html))*

- Whether or not a low-code tool was used, if any scripting/programming has been built, always try to follow best practice by 'committing' it to an online [Git](https://git-scm.com/book/en/v2/Getting-Started-What-is-Git%3F) Repository so that it can be securely backed up with an audit trail *([GitHub](https://github.com/) and [Azure DevOps Services](https://azure.microsoft.com/en-us/services/devops/) are both free for small teams)*

- Make sure at any given time there is a good version of the code that 'always builds', meaning any future development work should occur on 'branches' (or copies) of the code before finalizing those changes on the good version *([see some methods for how to organize your development](https://medium.com/@patrickporto/4-branching-workflows-for-git-30d0aaee7bf))*

- There is no such thing as documentation that is 'too descriptive'! It's highly unlikely it will ever be too much for any prospective users/contributors of bespoke code, and it diminishes the probability that the writer will tragically forget how exactly their code works *([which happens to the best of us](https://javascript.plainenglish.io/is-it-normal-to-forget-how-your-own-code-works-e5a6462f3571))*

---
---