
Data Wrangling & Tidying
==========================================

The data we receive often needs to be cleaned, transformed, restructured in order to provide any insights. This process is often called "data wrangling" or "data munging." A tidy dataset follows three fundamental rules:

* Each variable forms a column following the same units measurement 
* Each observation forms a row
* Each type of **observational unit** forms a table

An **observational unit** is the individual object or instance that we capture information about. For example, in a study about trees, the observational unit would be each tre

A dataset is considered to be in wide-form when at least one variable is represented across multiple columns as column headers rather than in a single column. 


Melt method
-------------------

Pandas' **.melt() method** helps convert a dataset from wide to long-form. 

Syntax:

    ::

       pd.melt(dataset, id_vars = , var_name = , value_name = )

The parameters for this function are:

* id_vars: name of column(s) of identifier variable(s). If there is more than one identifier variable, it can be written as id_vars
* var_name: name for the new single column containing the column names that are being combined 
* value_name: name for the bew single column of values.

Example:

data = pd.read_csv("players.csv")
data_tidy = pd.melt(data, id_vars=["Day", "Player"], var_name="Game", value_name="Score")


Pivot method
------------------

A dataset is considered “too long” when a single column in the dataset represents more than one variable, thus creating unecessary extra rows. Another way to think about it is that all numbers in a single column should have the same unit. Pandas' pivot method is useful here.

The pivot method allows us to reshape a dataset based on the values of a column. 

Syntax:

    ::

          pd.pivot(index = , columns = , values = ).reset_index

The parameters are:

* index: name of the column to make the new data frame's index 
* columns: name of the column to make the new DF's column headers 
* values: name of the column that will populate the new data frame's values 

Using the function .reset_index() and specifying .columns.name = None clears up some of the indexing that was carried over from the original form of the data set. 

Example:

    ::

        data = pd.read_csv("countries.csv")
        data_tidy = data.pivot(index = 'Country', columns = 'Feature', values = 'Observation').reset_index



Using Pandas to Clean Data
==========================================

https://www.codecademy.com/paths/data-analyst/tracks/dacp-data-wrangling-and-tidying/modules/dscp-fundamentals-of-data-wrangling-and-tidying/articles/intro-data-wrangling-and-tidying

The power of pandas is mainly in being able to manipulate large amounts of structured data. The first step of diagnosing whether or not a dataset is tidy is using pandas functions to explore and probe the dataset. Some of the most useful functions are:

To get an overall sense of the data:

* .head() — displays the first 5 rows of the table
* .info() — displays a summary of the table
* .describe() — displays summary statistics for the table
* .columns — displays column names of the table
* .value_counts() — displays the distinct values for a column
* .shape() - identifies the number of rows and columns in our dataset as (rows, columns)
* .dtypes - appends data types to our dataframe. Types of variables: float64, object, int64, bool, and datetime64. 
  
  ::
      
    print(restaurants.dtypes)

* .nunique() - looks at the number of unique values in each column


Cleaning up:

* .rename() - relabels columns, taking a dictionary, axis=1 refers to the columns, axis=0 would refer to the rows
  
  ::
      
      restaurants = restaurants.rename({'dba': 'name', 'cuisine description': 'cuisine'}, axis=1)

* .drop_duplicates() - removes duplicate rows
* tr.lstrip() to remove the prefixes
* isna() returns a boolean, which indicates if the observation in that column is missing (True) or not (False)
* map() function can be use with lower() and str() : restaurants.columns = map(str.lower, restaurants.columns) to create consistency
* crosstab() computes the frequency of two or more variables. We can use it with isna() tto identify if there is an NaN in that column. 
* melt() function 
* string.lower

    ::
        
        df['shoe_type'] = df.shoe_type.apply(string.lower)


Dealing with Multiple Files
-------------------------------

We can combine the use of **glob**, a Python library for working with files, with pandas to better organize data that is separated into multiple files. Glob can open multiple files by using regex matching to get the filenames.

The below code goes through any file that starts with 'file' and has an extension of .csv. It opens each file, reads the data into a DataFrame, and then concatenates all of those DataFrames together.

    ::

        import glob
        
        files = glob.glob("file*.csv")
        
        df_list = []
        for filename in files:
        data = pd.read_csv(filename)
        df_list.append(data)
        
        df = pd.concat(df_list)
        
        print(files)


Regular Exressions
---------------------------

**Regular expressions** are special sequences of characters that describe a pattern of text that needs to be found. Regular expressions operate by moving through a piece of text, character by character, from left to right. When a character is found that matches the first piece of the expression, it looks to find a continuous sequence of matching characters.

* **Literals** are regular expression that contain the exact text we want to match. 
* **Alternation** allows us to match the text preceding or following the pipe symbol \|\.

    ::
        
        apples|pears

* **Character sets** are denoted by brackets [ ] and let us match one character from a series of characters. The letters inside are the different possibilities for the character that appears in that position. We can also place the caret symbol **^** in front to negate our characters sets, matching any character *not* listed in brackets. 

    ::
        
        [cat] 
        # will match characters c, a, or t, but not the text cat 
        
        [^cat] 
        # will match any character that is *not* c, a, or t

  
* **Grouping** or **capture groups** lets us group parts of a regular expresssion together and limit the reach of the |to the text within ():
  
    ::

        I love (baboons|gorillas)
        # will match "I love" and then either "baboons" or "gorillas"



* **Wildcards** are represented by a period or dot . and will match any single character (letter, number, symbol or whitespace). We can use the escape character \when we want to match an actual period. 

    ::

        .........
        # will completely match any 9-character piece of text! 

        Howler monkeys are really lazy\.
        # will completely match the text "Howler monkeys are really lazy.""

* **Ranges** allow us to specify a range of characters in which we can make a match, without having to type out each one. The - indicates that we are matching a range. 

    ::
        
       [a-d]
       # would match any character a, b, or c
       # the above is equivalent to [a, b, c, d]


A range can be used to match any single:

* capital letter [A-Z]
* lowercase letter [a-z]
* digit [0-9]

We can also match multiple ranges in the same set.

    ::
        
        [A-Za-z]
        # would let us match any single capital or lowercase alphabetical character

* **Shorthand character classes** make writing regular expressions much simpler by representing common ranges. Examples include: \w word characters, \d digit characters, \s whitespace characters  
* **Groupings**, denoted with parentheses (), group parts of a regular expression together, and allows us to limit alternation to part of a regex
* **Fixed quantifiers**, represented with curly braces {}, let us indicate the exact quantity or a range of quantity of a character we wish to match. Note - quantifiers are greedy - they will match the **greatest quantity** of characters they possibly can and ignore smaller matches.
* **Optional quantifiers**, indicated by the question mark ?, allow us to indicate a character in a regex is optional, meaining it can appear either 0 or 1 time. They only apply to the character directly before the *?* (Note - since ? is a metacharacter, we need to use the escape character \ in our regex in order to match a question mark ? in a piece of text.)
* **The Kleene star** is also a quantifier and matches the preceding character 0 or more times, meaning the character doesn't need to appear, can appear once, or can appear many times. It is denoted with the asterisk *.
  
    ::

        meo*w 
        # will match "me", followed by 0 or more "o", followed by "w". 
        # Thus the regex will match "mew", "meow", "meooow", and "meoooooooooooow".
* **The Kleene plus**, denoted by the plus +, matches the preceding character 1 or more times.

    ::

        meo+w 
        # will match the characters "me", followed by 1 or more "o", followed by a "w". 
        # Thus the regex will match "meow", "meooow", and "meoooooooooooow", but not match mew.

* **The anchor symbols** hat ^ and dollar sign $ are used to match text at the start and end of a string, respectively

    ::

        \w{3} 
        # will match exactly 3 word characters

        \w{4,7} 
        # will match at minimum 4 and at maximum 7 word characters

        roa{3}r 
        # will match the characters "ro" followed by 3 "a", and then the character "r"

        mo{2,4} 
        # quantifiers are greedy - they will match the text "moooo" in the string "moooo", but not return a match of "moo" or "mooo"

    ::
        
        humou?r 
        # matches the characters "humo" then either 0 or 1 occurrence of "u" and finally "r" .

* **Anchors** ensure that we do not match unintended text by making the expression as specific as possible. The anchors hat ^ and dollar sign $ are used to match text at the start and the end of a string, respectively. 

    ::

        ^Monkeys: my mortal enemy$ 
        # ^ ensures that the matched text begins with "Monkeys", the $ ensures the matched text ends with "enemy".
        # will completely match "Monkeys: my mortal enemy" 
        # will not match "Spider Monkeys: my mortal enemy in the wild" or "Squirrel Monkeys: my mortal enemy in the wild" 
        # Without the anchor tags, the regex Monkeys: my mortal enemy will match the text Monkeys: my mortal enemy in both Spider Monkeys: my mortal enemy in the wild and Squirrel Monkeys: my mortal enemy in the wild.




Splitting
---------------------------------

Sometimes we receive data with columns that contain more than one type of data. We can use splitting to correct this. 

We can splitting by index when at least one of the parts is a set number of characters.

We can also split by character - good when the part to be split may vary in length but is always separated by a specific character.

This might be used to extract numbers contained within a string using pandas' **.str.split()** function with **regex**

  :: 
  
    students['grade'].str.split('(\d+)', expand=True[1])

Data Types
----------------------------------
Each column of a DataFrame can hold items of the same data type or dtype. Series objects compose all DataFrames.The dtypes that pandas uses are: float, int, bool, datetime, timedelta, category and object. Sometimes we'll want to convert types to make the data easier to work with. To see the types of each column of a DataFrame, we can use:

    ::

    print(df.dtypes)

The pandas' function .to_numeric() lets us convert strings containing numerical values to integers or floats. 

    ::

    fruit.price = pd.to_numeric(fruit.price)



String Parsing 
----------------------------------


Remove or replace unwanted characters using **.replace()** and **regex**

  :: 
    
    students.score = students['score'].replace('[\%,]', '', regex=True)



 
Duplicate rows can be removed using **.drop_duplicates()**. To remove only rows where the duplicate appears in certain columns, we can select subsets containing specifying which columns to look at.

  :: 
    
    df = df.drop_duplicates() ????
    df = df.drop_duplicates(subset=['reps'])


Missing Values
=================


Pandas provides the **isna()** and **notna()** functions to make detecting missing values easier. Some calculations will just skip NaN values, but others will break when NaN is encountered.

Using **.dropna()** will result in the DataFrame without the incomplete rows. We can also select a specific subset if we want to remove only rows containing NaN in specific columns.

We can use **.fillna()** to fill in missing values with the mean of the column or with some other aggregate value.

  :: 
  
    df['column_name'] = df['column_name'].fillna(value_to_fill_in)
    

Some helpful Python string methods
=====================================

Python string method .replace( ). ?? Check if needs to be saved as new variables to "stick" ??

    ::
        
        updated_medical_data = medical_data.replace("#", "$")
    
Python split() functions.

    ::
        
        medical_data_split = updated_medical_data.split(";")
    

Python .strip() method removes spaces at the beginning and at the end of the string. 
