<h3 align="center"> Unsupervised Learning Capstone - Data Cleaning and Processing Only Notebook</h3> 

__Contents__<a name="top"></a>
1. [Dataset  Description and Notes](#describe)
1. [Objective](#object)
1. [Example XML Files](#xml)
2. [Cleaning and Processing Data](#clean)
 - [Process01 Function](#proc1)
 - [Process02 Function](#proc2)
 - [Process03 Function](#proc3)
3. [Create Dataframe and Save to File](#file)
4. [Unit Tests](#test)
5. [Scratch](#scratch)

### Dataset  Description and Notes <a name="describe"></a>

Here is the
The main reason that I choose this dataset is beyond the text provided by the blog posts, there are four additonal features of gender, age, occupation and sign that can be targeted. I will admit that maybe the idea of using a horoscope sign feature for a target intrigued me more than it should.  

The dataset includes:

- There are 19,322 xml files that vary greatly in size. Some lack the values for the occupation feature.  While most of the xml values required cleaning before they could be parsed, some had to be discarded because they were no where near *"well formed"*.
- To develop a classifier model that targets blogger, I used the 10 largest files.
- Seperate classifer models that target the other four features are developed from  a set of 200 files with a size of around 25 kb.  That seem to give me a reasonable amount of post samples for each blogger.  I did not use a file that lacked an occupation value nor any with an age of 21 or less.   
- Sort xml files by size and select about 200 files in the 25k byte range that have a reasonable amount of text / blog entries
-  Filter out files that don't have info for all features.  Filter out bloggers with age less than 25. 
-  *import lxml.etree as ET* not xml.etree due to decoding issues in the xml files
- Some xml files still have decoding issues and are not used. 

*Please cite the following if you use the data:*

J. Schler, M. Koppel, S. Argamon and J. Pennebaker (2006). Effects of Age and Gender on Blogging in Proceedings of 2006 AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs.

### Objective<a name="object"></a>

Create three sample sets as DataFrames:
 - "A" is from the 10 of the most prolific bloggers with approxiamately 8500 samples.
 - "B" is from approxiamately 200 bloggers  with approxiamately 4500 samples.
Create a Dataframe for exploring features and feature engineering
 - "F" is "B" without the NLP features and one sample per blogger.  

### Example XML File<a name="xml"></a>

The data is contained in xml files and the file name itself. An example file name is: 
- 3025353.male.35.Religion.Aquarius.xml
  - name: 3025353
  - gender : male
  - age : 35
  - occupation : religon
  - sign: Aquarius
  
The root element of the xml file is "blog" with child element pairs of "date" and "post" as shown below:

\< blog \>
- \< date \> 30,June,2003 \< /date \>
- \< post \> Anti-war or Civil War?   When I went on  ... It's so hard to make a statement these days. \< /post \>
- \< date \> 24,June,2003 \< /date \>
- \< post \> "People talk about the 'divorce epidemic'  as if  ...  community that shoots its wounded.\< /post \>

\< blog \>

The raw dataset provides seven features or sources of features including "blogger", "gender", "age", "occupation", "sign", "date" and "post".  

### Cleaning and Processing Data<a name="clean"></a>

#### First Round of Processing<a name="proc1"></a>
- Extract Features from File Name
- Select Files Based upon Features
- Clean xml Unicode

In [6]:
import pandas as pd
import numpy as np
from pathlib import Path
from string import ascii_uppercase

In [7]:
# files are list of xml files
df = pd.read_csv('blogs/files_A.txt', header=0)
a_xmls = df.xml.tolist()
df = pd.read_csv('blogs/files_B.txt', header=0)
b_xmls = df.xml.tolist()

In [8]:
from lxml import etree 
# list of chars that cause XMLSyntaxError during parsing
bads = []
for i in range(0, 100):
    try: x = etree.fromstring(('<p>%s</p>' % chr(i)))
    except etree.XMLSyntaxError: bads.append(i)
bads.remove(38); bads.remove(60)

In [10]:
# filter out ages < 21 and with unknow occupation, cleans xml file prior to parsing
def process_01(file):
    S = file.split(sep='.')
    if any([int(S[2]) <= 21, S[3] == 'indUnk']):
        return None
    else:
        txt = Path(r'blogs/' + file).read_text(errors='replace')
        clean = ''.join([' ' if ord(t) in bads else t for t in txt])
        return {'blgr':S[0], 'gndr':S[1], 'age':int(S[2]), 'ocpn':S[3], 'sign':S[4], 'txt': clean}

In [11]:
a1_dcts = [process_01(xml) for xml in a_xmls]
b1_dcts = [x for x in [process_01(xml) for xml in b_xmls] if bool(x)]
# change blogger id to char becuase as int it causes trouble 
for (i,dct) in enumerate(a1_dcts):
    dct['blgr'] =  ascii_uppercase[i]

[__Top__](#top)

#### Second Round of Processing<a name="proc2"></a>
- Parse xml
- Clean Dates
- Parse Dates
- Tokenize Text

In [35]:
import nltk
from nltk.corpus import stopwords
import spacy
from spacy.lang.en import English
nlp = spacy.load("en_core_web_lg")

In [13]:
def clean_date(date):
    bad_mth = {'Juni':'June', 'Juli':'July', 'juillet':'July'} 
    S = date.split(sep=',')
    if S[1] in bad_mth:
        date = ''.join([S[0], bad_mth[S[1]], S[2]])
    return pd.Timestamp(date)

In [111]:
# parse the xml and filter files lacking dates
def process_02(dct, nlp):
    parser = etree.XMLParser(remove_blank_text=True, recover=True)
    root = etree.fromstring(dct['txt'], parser=parser)
    if len(root[0].text) <= 6:
        return None
    else: 
        dct['posts']  = [x.text for x in root if x.tag == 'post']
        dct['dates'] = [clean_date(x.text) for x in root if x.tag == 'date']
    
        return dct   

In [112]:
a2_dcts = [dct for dct in [process_02(dct, nlp) for dct in a1_dcts] if dct != None]
b2_dcts = [dct for dct in [process_02(dct, nlp) for dct in b1_dcts] if dct != None]

[__Top__](#top)

#### Third Round of Processing<a name="proc3"></a>
- Create DataFrame of Samples

In [114]:
# transform post lemmas to one string and create DataFrame of samples
def process_03(dct, name):
    n = len(dct['posts'])
    lnths = [doc.__len__() for doc in dct['posts']]
    if name == 'A':
        blgrs = np.repeat(dct['blgr'],n)
        df = pd.DataFrame([blgrs, dct['dates'], dct['posts'], lnths], index=['blgr', 'date', 'post', 'lnth']).T    
    else:    
        gndr = np.repeat(dct['gndr'],n)
        age  = np.repeat(dct['age' ],n)
        ocpn = np.repeat(dct['ocpn'],n)
        sign = np.repeat(dct['sign'],n)
        df = pd.DataFrame([ gndr, age, ocpn, sign, dct['dates'], dct['posts'], lnths],
                          index=[ 'gndr', 'age', 'ocpn', 'sign', 'date', 'post', 'lnth']).T 
    #catches blank / mis-parsed posts in xml file
    df = df.drop(list(np.where(df.lnth < 3)[0]), axis=0)
    return df

In [116]:
a3_dfs = [process_03(dct, 'A') for dct in a2_dcts]
b3_dfs = [process_03(dct, 'B') for dct in b2_dcts]

In [123]:
# data frame for blogger "gender", "age", "occupation" and  "sign" features.
gndrs = pd.Series([dct['gndr']  for dct in b1_dcts])
ages  = pd.Series([dct['age']   for dct in b1_dcts])
signs = pd.Series([dct['sign']  for dct in b1_dcts]) 
ocpns = pd.Series([dct['ocpn']  for dct in b1_dcts])
dff = pd.DataFrame([gndrs,ages, signs, ocpns], index=['gndrs', 'ages', 'signs', 'ocpns']).T

[__Top__](#top)

### Create Dataframes and Save to Files<a name="file"></a>

 

In [122]:
# combine sub DataFrames and reindex
dfa = pd.concat(a3_dfs, axis=0)
dfa.index = range(len(dfa))
dfb = pd.concat(b3_dfs, axis=0)
dfb.index = range(len(dfb))

In [124]:
# save to file
dfa.to_csv(r'data/dfa.csv')
dfb.to_csv(r'data/dfb.csv')
dff.to_csv(r'data/dff.csv')

[__Top__](#top)

#### Tests<a name="test"></a>
- always be testing

In [None]:
# reading xml file list tests
assert len(a_xmls) == 10 == len(df)
assert len(b_xmls) == len(df)

In [None]:
# process_01 tests
t1 = '4326228.male.17.Student.Cancer.xml'
t2 = '2278942.male.26.Technology.Virgo.xml'
assert process_01(t1) == None
assert process_01(t2)['sign'] == 'Virgo'

In [None]:
# process_02 tests
assert clean_date('02,June,2004') == pd.Timestamp('06-02-2004')
assert clean_date('02,Juli,2004') == pd.Timestamp('07-02-2004')
assert b2_dcts[0].keys() == a2_dcts[0].keys() 

#### Scratch
- snippets not used but not ready to delete

In [None]:
#tokenize, lemmaize, use both spaCy and ntlk stops
def lemma_post(doc):
    stops = stopwords.words('english') + ['urllink', 'urlLink']
    A = [d for d in doc if all([d.is_alpha, not d.is_stop])]
    B = [a for a in A if a.lemma_ not in stops]
    return [b.lemma_ if all([b.pos_ == 'PROPN', not b.is_upper]) else b.lemma_.lower() for b in B] 

assert lemma_post(nlp('urlLink www.dmbirc.org')) == []
assert lemma_post(nlp('Tom is FAT')) == ['Tom', 'fat']

In [None]:
# parse the xml and filter files lacking dates
def process_02(dct, nlp):
    parser = etree.XMLParser(remove_blank_text=True, recover=True)
    root = etree.fromstring(dct['txt'], parser=parser)
    if len(root[0].text) <= 6:
        return None
    else: 
        docs = [nlp(x.text) for x in root if x.tag == 'post']
        dct['vocab']  = [doc.vocab for doc in docs]
        dct['sents']  = [doc.sents for doc in docs]
        dct['lemmas'] = [lemma_post(doc) for doc in docs]
        #dct['feels']  = [doc.sentiment for doc in docs]
        dct['lngths'] = [len(x.text) for x in root if x.tag == 'post']
        dct['dates'] = [clean_date(x.text) for x in root if x.tag == 'date']
    
        return dct     