# Lab - Object Oriented Programming

In [1]:
import pandas as pd
import numpy as np

# Challenge 2

In order to understand the benefits of simple object-oriented programming, we have to build up our classes from the beginning. 

You'll use the following dataframe generator to test some things. Try to understand what the following function does.

In [2]:
chars = ['a', 'b', 'c','d', 'e', 'f', ' ', 'á','é','ó']

def create_weird_dataframe(size=10):
    def create_weird_colnames(size=size):
        probs = [.2,.2,.15,.1,.1,.1,.05,.05,.025,.025]

        return [''.join(
            [(char.upper() if np.random.random() < 0.2 else char) 
                     for char in np.random.choice(chars,size=12, p=probs)]) for i in range(size)]
    
    data = np.random.random(size=(size,size))
    colnames = create_weird_colnames(size)
    return pd.DataFrame(data=data, columns=colnames)

Test the results of running that function below. Run it several times

In [3]:
df = create_weird_dataframe()
df.head()

Unnamed: 0,dbAabFÁÁfóee,befbadcDóba,dbeceecfbaba,ÁcfadcafCÁbA,acaáÁbcdffa,ddAdFccAcfda,b óáádfbábBe,ábAAeaBc fAa,EaafcfbÉCaaB,eadaabae db
0,0.832937,0.769406,0.30176,0.952458,0.502216,0.355087,0.31084,0.798624,0.804225,0.241552
1,0.715352,0.232911,0.255348,0.302425,0.700295,0.27062,0.802789,0.571201,0.373515,0.301756
2,0.647627,0.916805,0.372218,0.684945,0.206967,0.370761,0.547854,0.189569,0.707718,0.512081
3,0.51416,0.386404,0.181874,0.765299,0.706728,0.376028,0.491603,0.067649,0.629088,0.94362
4,0.837216,0.17544,0.920006,0.484481,0.836141,0.114527,4.9e-05,0.094516,0.921951,0.149547


In [4]:
df.columns

Index(['dbAabFÁÁfóee', 'befbadcDóba ', 'dbeceecfbaba', 'ÁcfadcafCÁbA',
       'acaáÁbcdffa ', 'ddAdFccAcfda', 'b óáádfbábBe', 'ábAAeaBc fAa',
       'EaafcfbÉCaaB', ' eadaabae db'],
      dtype='object')

## Correcting the column names

We'll create a function that rename the weird column names. The idea is to, later, extend that idea to our own brand new dataframe class.

### let's start simple: get the column names of the dataframe.

Store it in a variable called `col_names`


In [5]:
colnames = create_weird_colnames(size)

NameError: name 'create_weird_colnames' is not defined

### Let's iterate through this columns and transform them into lower-case column names

Create a list comprehension to do that if possible. Store it in a variable called `lower_colnames`

In [6]:
lower_colnames = [colnames.lower() for colnames in (list(df.columns))]

lower_colnames

['dbaabfááfóee',
 'befbadcdóba ',
 'dbeceecfbaba',
 'ácfadcafcába',
 'acaáábcdffa ',
 'ddadfccacfda',
 'b óáádfbábbe',
 'ábaaeabc faa',
 'eaafcfbécaab',
 ' eadaabae db']

### Let's remove the spaces of these column names!

Replace each column name space ` ` for an underline `_`. Again, try to use a list comprehension to do that. 
For this first task use `.replace(' ','_')` method to do that.

In [7]:
normalize_cols = [colnames.replace(' ','_') for colnames in lower_colnames]
normalize_cols

['dbaabfááfóee',
 'befbadcdóba_',
 'dbeceecfbaba',
 'ácfadcafcába',
 'acaáábcdffa_',
 'ddadfccacfda',
 'b_óáádfbábbe',
 'ábaaeabc_faa',
 'eaafcfbécaab',
 '_eadaabae_db']

### Create a function that groups the results obtained above and return the lower case underlined names as a list

Name the function `normalize_cols`. This function should receive a dataframe, get the column names of a it and return the treated list of column names.

In [8]:
def normalize_cols(dataframe):
    treated_column = [colnames.lower().replace(' ','_') for colnames in (list(dataframe.columns))]

    
    return treated_column



### Test your results

Use the following line of code to test your results. Run it several times to see some behaviors.

In [9]:
normalize_cols(create_weird_dataframe())

['cáácbafcfb_f',
 'ca_bfbbebcae',
 'cdefáacáaába',
 'bcbaacbabab_',
 'eeáe_addbccá',
 'caáad_acafcf',
 'cbdeaacacaeb',
 'ébfcéaabebbb',
 'ebaebaccbbad',
 'bafbbbca_fbe']

### hmmm, we've made a mistake!

We've commited several mistakes by doing this. Have observed any bugs associated with our results?

In order for us to see some problems in our results, we have to look for edge cases. 

For example: 

**Problem #1:** what if there are 2 or more following spaces? We want it to replace the spaces by several underlines or condense them into one?

**Problem #2:** what if there are spaces at the beginning? Should we substitute them by underline or drop them?

Let's correct each problem. Starting by problem 2.

## Correcting our function

Instead of substituting the spaces at first place, let's remove the trailing and leading spaces!

Recreate the `normalize_cols` with the solution to `Problem 2`.

*Hint: Copy and paste the last `normalize_cols` function to change it.*

In [10]:
def normalize_cols(dataframe):
    treated_column = [colnames.lower().strip() for colnames in (list(dataframe.columns))]

    
    return treated_column
normalize_cols(create_weird_dataframe())

['debdddóddfbf',
 'efaeaab cceá',
 'daeb adfcbbá',
 'feáaaaóbéfac',
 'bbacbbcdccad',
 'cabfdbaeccác',
 'e c eáeádac',
 'bccbbcfcacbd',
 'edf aadé óbb',
 'cf ábbeac é']

### Test your results again.

At least, for now, you should not have any trailing nor leading underlines.

In [11]:
normalize_cols(create_weird_dataframe())

['défabbfccbdc',
 'cdaadcaab cb',
 'faacfóáfdbáa',
 'bababcdedacc',
 'báaaaaféófbc',
 'ebdaáóóbbfea',
 'bbfbócf faad',
 'ddbóaecbeaae',
 'cbbdd a cc ó',
 'bfadfácéábff']

### Correcting problem 1

To correct problem 1, instead of using `.replace()` string method, we want to use a regular expression. Use the module `re` to substitute the pattern of `1 or more spaces` by 1 underline `_`.

Test your solution on the variable below:

In [12]:
import re 

text = 'these spaces      should all be one underline'
re.sub('\s+','_', text)

'these_spaces_should_all_be_one_underline'

### Now correct your `normalize_cols` function

*Hint: Copy and paste the last `normalize_cols` function to change it.*

In [13]:
def normalize_cols(dataframe):
    import re
    treated_column = [re.sub('\s+','_', colnames.lower().strip()) for colnames in (list(dataframe.columns))]
    
    return treated_column

### Again, test your results.

Now, sometimes some column names should have smaller sizes (because you are removing consecutive spaces)

In [14]:
normalize_cols(create_weird_dataframe())

['bcéeadabéfab',
 'cáabbb_fábbf',
 'dfcdfacá_abb',
 'bábcéca_ae_ó',
 'bbcáfdadaebc',
 'bbcbeccffbb',
 'aecabdacfecó',
 'abcfca_cabaó',
 'bbe_bbáafcae',
 'dbfaf_áaabba']

## Last step: remove accents

The last step consists in removing accents from the strings.

Import the package `unidecode` to use its module also called `unidecode` to remove accents. Test on the word below.

In [15]:
pip install unidecode

Note: you may need to restart the kernel to use updated packages.


In [16]:
from unidecode import unidecode

In [17]:
text = 'aéóúaorowó'

In [21]:
x = unidecode(text)
x

'aeouaorowo'

### Now remove the accents for each column name in your `normalized_cols` function.

*Hint: Copy and paste the last `normalize_cols` function to change it.*

In [24]:
def normalize_cols(dataframe):
    import re
    treated_column = [unidecode(re.sub('\s+','_', colnames.lower().strip())) for colnames in (list(dataframe.columns))]
    
    return treated_column

### Test your results

In [25]:
normalize_cols(create_weird_dataframe())

['aeffobceebcc',
 'aec_beceecd',
 'a_fafaadfab',
 'baaeofbaffec',
 'aeb_eddcb_a',
 'ccabe_cecbab',
 'oaedabefao',
 'caadacafaaeb',
 'cbadfefadcaa',
 'bfodbcbabcdb']

## Good job. 

Right now you have a function that receives a dataframe and returns its columns names with a good formatting.

# Creating our own dataframe.

In [26]:
from pandas import DataFrame

A dataframe is just a simple class. It contains its own attributes and methods. 

When you create a pd.DataFrame() you are just instantiating the DataFrame class as an object that you can store in a variable. From this point onwards, you have access to all DataFrame class attributes (`.columns` for example) and methods (`.isna()` for example). We've been using those since always! 

If we wish, we could create our own class inheriting everything from a DataFrame class.

In [27]:
class myDataFrame(DataFrame):
    pass

Instead of just creating myDataFrame, put your function inside your new inherited class, that is, transform `normalize_cols` into a method of your own DataFrame.

Remember you'll have to give self as the first argument of the `normalize_cols`. So you could replace everything you once called `dataframe` inside your `normalize_cols` by `self`. 

At the end, return the list of the correct names.

In [32]:
class myDataFrame(DataFrame):
    
    def normalize_cols(self):
        import re
        treated_column = [unidecode(re.sub('\s+','_', colnames.lower().strip())) for colnames in (list(self.columns))]

        return treated_column

Test your results.

In [33]:
df = myDataFrame(create_weird_dataframe())
df.normalize_cols()

['dccaebcaface',
 'ccfcbedacfba',
 'ea_befecfeea',
 'ebbeebcacccd',
 'bbdadbfaadab',
 'ffbeabebaaof',
 'becbbeefbcbc',
 'addcdcoddaba',
 'efeba_debaaf',
 'bcbcacadeaab']

## Understanding even more the `self` argument

Instead of returning a list containing the correct columns, you should now assign the correct columns to the `self.columns` - this will effectively replace the values of your object by the correct columns.


Now change your method to return the dataframe itself. That is, return the `self` argument this time and see the results! 

```python
class myDataFrame(DataFrame):
    def normalize_cos(self):
        ...
        return self
```

In [53]:
class myDataFrame(DataFrame):
    def normalize_cols(self):
        import re
        treated_column = [unidecode(re.sub('\s+','_', colnames.lower().strip())) for colnames in (list(self.columns))]

        self.columns = treated_column
        return self

In [54]:
df = myDataFrame(create_weird_dataframe())
df

Unnamed: 0,bcfbcbbeAdEd,abafafcbBaóc,AAcabb fccFa,ebadafébfcfe,c DBAbb BBdf,éFéBCcaedCEC,babbaaBcbdc,aB baefbfcfD,Féáó cÉábaBc,aaAféFfcc Aa
0,0.297757,0.570887,0.311847,0.749167,0.533306,0.018265,0.925528,0.093832,0.561108,0.969441
1,0.808092,0.578903,0.428504,0.153232,0.264628,0.595962,0.193105,0.351042,0.220575,0.715059
2,0.457857,0.231419,0.168292,0.560404,0.53719,0.564746,0.259711,0.744273,0.37099,0.006338
3,0.53928,0.505421,0.10546,0.506075,0.23106,0.280079,0.442383,0.231805,0.457478,0.146872
4,0.839326,0.999988,0.995065,0.292315,0.094592,0.091712,0.504636,0.257063,0.91664,0.42342
5,0.023922,0.005191,0.06117,0.069762,0.647027,0.621679,0.29178,0.91661,0.90728,0.577293
6,0.799428,0.142639,0.275642,0.873196,0.541442,0.821746,0.272702,0.646433,0.940519,0.104644
7,0.9784,0.476219,0.529806,0.093248,0.878261,0.129385,0.023349,0.788104,0.420515,0.656825
8,0.689449,0.442449,0.554378,0.722603,0.705576,0.162034,0.634017,0.280352,0.769626,0.123277
9,0.626104,0.897269,0.760253,0.977981,0.251798,0.760812,0.849754,0.773092,0.407095,0.770327


In [55]:
df.normalize_cols()

Unnamed: 0,bcfbcbbeaded,abafafcbbaoc,aacabb_fccfa,ebadafebfcfe,c_dbabb_bbdf,efebccaedcec,babbaabcbdc,ab_baefbfcfd,feao_ceababc,aaafeffcc_aa
0,0.297757,0.570887,0.311847,0.749167,0.533306,0.018265,0.925528,0.093832,0.561108,0.969441
1,0.808092,0.578903,0.428504,0.153232,0.264628,0.595962,0.193105,0.351042,0.220575,0.715059
2,0.457857,0.231419,0.168292,0.560404,0.53719,0.564746,0.259711,0.744273,0.37099,0.006338
3,0.53928,0.505421,0.10546,0.506075,0.23106,0.280079,0.442383,0.231805,0.457478,0.146872
4,0.839326,0.999988,0.995065,0.292315,0.094592,0.091712,0.504636,0.257063,0.91664,0.42342
5,0.023922,0.005191,0.06117,0.069762,0.647027,0.621679,0.29178,0.91661,0.90728,0.577293
6,0.799428,0.142639,0.275642,0.873196,0.541442,0.821746,0.272702,0.646433,0.940519,0.104644
7,0.9784,0.476219,0.529806,0.093248,0.878261,0.129385,0.023349,0.788104,0.420515,0.656825
8,0.689449,0.442449,0.554378,0.722603,0.705576,0.162034,0.634017,0.280352,0.769626,0.123277
9,0.626104,0.897269,0.760253,0.977981,0.251798,0.760812,0.849754,0.773092,0.407095,0.770327


In [56]:
df

Unnamed: 0,bcfbcbbeaded,abafafcbbaoc,aacabb_fccfa,ebadafebfcfe,c_dbabb_bbdf,efebccaedcec,babbaabcbdc,ab_baefbfcfd,feao_ceababc,aaafeffcc_aa
0,0.297757,0.570887,0.311847,0.749167,0.533306,0.018265,0.925528,0.093832,0.561108,0.969441
1,0.808092,0.578903,0.428504,0.153232,0.264628,0.595962,0.193105,0.351042,0.220575,0.715059
2,0.457857,0.231419,0.168292,0.560404,0.53719,0.564746,0.259711,0.744273,0.37099,0.006338
3,0.53928,0.505421,0.10546,0.506075,0.23106,0.280079,0.442383,0.231805,0.457478,0.146872
4,0.839326,0.999988,0.995065,0.292315,0.094592,0.091712,0.504636,0.257063,0.91664,0.42342
5,0.023922,0.005191,0.06117,0.069762,0.647027,0.621679,0.29178,0.91661,0.90728,0.577293
6,0.799428,0.142639,0.275642,0.873196,0.541442,0.821746,0.272702,0.646433,0.940519,0.104644
7,0.9784,0.476219,0.529806,0.093248,0.878261,0.129385,0.023349,0.788104,0.420515,0.656825
8,0.689449,0.442449,0.554378,0.722603,0.705576,0.162034,0.634017,0.280352,0.769626,0.123277
9,0.626104,0.897269,0.760253,0.977981,0.251798,0.760812,0.849754,0.773092,0.407095,0.770327
