These are some notes gathered from various CodeAcademy courses and a bit of Googling.

# Data Wrangling & Tidying


The process of cleaning up and preparing data for analysis is often called "data wrangling" or "data munging." The goal is to produce a "tidy" dataset.

A dataset is considered tidy if:

1. Each variable forms a column following the same units measurement 
2. Each observation forms a row
3. Each type of **observational unit** forms a table

An **observational unit** is the individual "thing" we capture information about. For example, in a study about residential real estate sales, the observational unit might be each home.

A dataset is considered **wide-form** when a variable is spread across multiple columns as column headers, rather than consolidated into a single column. 


## Why pandas?


Pandas' ability to manipulate large amounts of structured data makes it a powerful data science tool. It provides a number of functions that make it easy to explore dataset and evaluate its overall tidiness. 

Evaluating tidiness
------------------------------------

Some quick ways to evaluate tidiness using Pandas:

* .head() - displays first few rows, 5 by default, or another number if specified
* .info() - displays a summary of the table
* .describe() - displays summary statistics for the table
* .nunique() - gives the number of unique values in each column
* .unique() - returns the unique values in a column as an array
* .value_counts() - displays the number of times each value occurs
* .dtypes - appends data types to our dataframe - float64, object, int64, bool, datetime64
* .shape - identifies the number of rows and columns as (rows, columns)
* .columns - displays column names 


To see some examples of these functions at work, we'll begin by importing pandas. We can then create a dataframe from our fake census dataset csv file. 

In [1]:
import pandas as pd
filename = "C:/Users/Julie/git/portfolio/data/fake_census_data.csv"
census = pd.read_csv(filename, index_col=0)

In [2]:
census.head()

# returns the first few rows of data, making it easy to quickly assess the variable types. 
# In the example, we see that the values in first_name are categories that do not contain an order or ranking, while birth_year contains numeric values that must be expressed in whole integers.

Unnamed: 0,first_name,last_name,birth_year,voted,num_children,income_year,higher_tax,marital_status
0,Denise,Ratke,2005,False,0,92129.41,disagree,single
1,Hali,Cummerata,1987,False,0,75649.17,neutral,DIVORCED
2,Salomon,Orn,1992,True,2,166313.45,agree,Single
3,Sarina,Schiller,1965,False,2,71704.81,strongly agree,married
4,Gust,Abernathy,1945,False,2,143316.08,agree,married


In [3]:
census.shape

(100, 8)

In [4]:
census.columns

Index(['first_name', 'last_name', 'birth_year', 'voted', 'num_children',
       'income_year', 'higher_tax', 'marital_status'],
      dtype='object')

Using **.info()** or **.dtypes** we can see that the datatype for birth_year doesn't make sense. It's listed as an object, but should really be an integer. 

In [5]:
census.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 0 to 99
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   first_name      100 non-null    object 
 1   last_name       100 non-null    object 
 2   birth_year      100 non-null    object 
 3   voted           100 non-null    bool   
 4   num_children    100 non-null    int64  
 5   income_year     100 non-null    float64
 6   higher_tax      100 non-null    object 
 7   marital_status  100 non-null    object 
dtypes: bool(1), float64(1), int64(1), object(5)
memory usage: 6.3+ KB


In [7]:
census.dtypes

first_name         object
last_name          object
birth_year         object
voted                bool
num_children        int64
income_year       float64
higher_tax         object
marital_status     object
dtype: object

In [6]:
census.describe()

Unnamed: 0,num_children,income_year
count,100.0,100.0
mean,1.81,111380.7897
std,1.433333,49015.171775
min,0.0,35635.14
25%,0.75,71246.52
50%,2.0,104990.805
75%,3.0,153492.09
max,4.0,198123.77


Using **.unique()** to investigate further, we see that one of the birth years is listed as 'missing'. 

In [8]:
census['birth_year'].nunique()

53

In [9]:
census['birth_year'].unique()

array(['2005', '1987', '1992', '1965', '1945', '1951', '1963', '1949',
       '1950', '1971', '2007', '1944', '1995', '1973', '1946', '1954',
       '1994', '1989', '1947', '1993', '1976', '1984', 'missing', '1966',
       '1941', '2000', '1953', '1956', '1960', '2001', '1980', '1955',
       '1985', '1996', '1968', '1979', '2006', '1962', '1981', '1959',
       '1977', '1978', '1983', '1957', '1961', '1982', '2002', '1998',
       '1999', '1952', '1940', '1986', '1958'], dtype=object)

In [10]:
census.birth_year.value_counts()

1961       4
1949       4
2005       3
2007       3
1985       3
1989       3
2006       3
1954       3
1946       3
1973       3
1995       3
1971       3
1978       3
1963       3
1951       3
1992       3
1966       3
1962       2
1941       2
1998       2
1955       2
2001       2
1960       2
1987       2
1953       2
1984       2
1945       2
1994       2
2000       1
1977       1
1986       1
1940       1
1952       1
1999       1
1965       1
2002       1
1982       1
1957       1
1983       1
1950       1
1959       1
missing    1
1981       1
1944       1
1979       1
1968       1
1996       1
1947       1
1993       1
1980       1
1976       1
1956       1
1958       1
Name: birth_year, dtype: int64

Basic cleanup
------------------------------------

Some useful functions for cleaning data:

* .replace()
* .rename() - relabels columns, taking a dictionary, axis=1 refers to the columns, axis=0 would refer to the rows
* .astype() -
* .drop_duplicates() - removes duplicate rows
* tr.lstrip() to remove the prefixes
* .strip()
* isna() returns a boolean, which indicates if the observation in that column is missing (True) or not (False)
* map() function can be use with lower() and str() : restaurants.columns = map(str.lower, restaurants.columns) to create consistency
* crosstab() computes the frequency of two or more variables. We can use it with isna() tto identify if there is an NaN in that column. 
* string.lower
* .to_numeric() - lets us convert strings containing numerical values to integers or floats

We begin basic cleanup by using **.replace()** to replace "missing" with "1967". To check the results, we call unique() again.

In [11]:
census.birth_year = census.birth_year.replace('missing', 1967)
census['birth_year'].unique()

array(['2005', '1987', '1992', '1965', '1945', '1951', '1963', '1949',
       '1950', '1971', '2007', '1944', '1995', '1973', '1946', '1954',
       '1994', '1989', '1947', '1993', '1976', '1984', 1967, '1966',
       '1941', '2000', '1953', '1956', '1960', '2001', '1980', '1955',
       '1985', '1996', '1968', '1979', '2006', '1962', '1981', '1959',
       '1977', '1978', '1983', '1957', '1961', '1982', '2002', '1998',
       '1999', '1952', '1940', '1986', '1958'], dtype=object)

Now that the missing values has been fixed, we can use **.astype()** or **pd.to_numeric** to change birth year to the correct data type, integer. ?? Why does astype make the integer type int32?

In [12]:
census.birth_year = pd.to_numeric(census.birth_year)
census.dtypes

first_name         object
last_name          object
birth_year          int64
voted                bool
num_children        int64
income_year       float64
higher_tax         object
marital_status     object
dtype: object

In [13]:
census.birth_year = census.birth_year.astype(int)
census.dtypes

first_name         object
last_name          object
birth_year          int32
voted                bool
num_children        int64
income_year       float64
higher_tax         object
marital_status     object
dtype: object

We want all of the values in Marital_Status to be lowercase strings, so we use **str.lower** to fix this.

In [14]:
census['marital_status'] = census.marital_status.apply(str.lower)
census.marital_status

0       single
1     divorced
2       single
3      married
4      married
        ...   
95     married
96      single
97      single
98     widowed
99      single
Name: marital_status, Length: 100, dtype: object

In [16]:
census.dtypes

first_name         object
last_name          object
birth_year          int32
voted                bool
num_children        int64
income_year       float64
higher_tax         object
marital_status     object
dtype: object

Reshaping Datasets
--------------------------

* .pivot() - allows us to reshape a dataset based on the values of a column
* .melt() - helps convert a dataset from wide to long-form.
* .reset_index() - clears up some of the indexing carried over from the original form of the data set 

**.pivot()** lets us reshape datasets that are too long. A dataset is considered “too long” when a single column in the dataset represents more than one variable, thus creating unecessary extra rows. Another way to think about it is that all numbers in a single column should have the same unit. .

Syntax:
pd.pivot(index = , columns = , values = ).reset_index

Parameters: 

* index: name of the column to make the new data frame's index 
* columns: name of the column to make the new DF's column headers 
* values: name of the column that will populate the new data frame's values 

**.melt()** is useful for... 

Using the function .reset_index() and specifying .columns.name = None clears up some of the indexing that was carried over from the original form of the data set.

Dealing with Multiple Files
-------------------------------

We can combine the use of **glob**, a Python library for working with files, with pandas to better organize data that is separated into multiple files. Glob can open multiple files by using regex matching to get the filenames.

The below code goes through any file that starts with 'file' and has an extension of .csv. It opens each file, reads the data into a DataFrame, and then concatenates all of those DataFrames together.

    ::

        import glob
        
        files = glob.glob("file*.csv")
        
        df_list = []
        for filename in files:
        data = pd.read_csv(filename)
        df_list.append(data)
        
        df = pd.concat(df_list)
        
        print(files)


Regular Expressions
---------------------------

**Regular expressions** are special sequences of characters that describe a pattern of text that needs to be found. They operate by moving through a piece of text, character by character, from left to right. When a character is found that matches the first piece of the expression, it looks to find a continuous sequence of matching characters.

* **Literals** are regular expression that contain the exact text we want to match. 
* **Alternation** allows us to match the text preceding or following the pipe symbol \|\.
     
        apples|pears

* **Character sets** are denoted by brackets [ ] and let us match one character from a series of characters. The letters inside are the different possibilities for the character that appears in that position. We can also place the caret symbol **^** in front to negate our characters sets, matching any character *not* listed in brackets. 

        [cat] 
        # will match characters c, a, or t, but not the text cat 
        
        [^cat] 
        # will match any character that is *not* c, a, or t
  
* **Grouping** or **capture groups** lets us group parts of a regular expresssion together and limit the reach of the |to the text within ():

        I love (baboons|gorillas)
        # will match "I love" and then either "baboons" or "gorillas"

* **Wildcards** are represented by a period or dot . and will match any single character (letter, number, symbol or whitespace). We can use the escape character \when we want to match an actual period. 

        .........
        # will completely match any 9-character piece of text! 

        Howler monkeys are really lazy\.
        # will completely match the text "Howler monkeys are really lazy.""

* **Ranges** allow us to specify a range of characters in which we can make a match, without having to type out each one. The - indicates that we are matching a range. 
        
       [a-d]
       # would match any character a, b, or c
       # the above is equivalent to [a, b, c, d]

A range can be used to match any single:

* capital letter [A-Z]
* lowercase letter [a-z]
* digit [0-9]

We can also match multiple ranges in the same set.

        [A-Za-z]
        # would let us match any single capital or lowercase alphabetical character

* **Shorthand character classes** make writing regular expressions much simpler by representing common ranges. Examples include: \w word characters, \d digit characters, \s whitespace characters  
* **Groupings**, denoted with parentheses (), group parts of a regular expression together, and allows us to limit alternation to part of a regex
* **Fixed quantifiers**, represented with curly braces {}, let us indicate the exact quantity or a range of quantity of a character we wish to match. Note - quantifiers are greedy - they will match the **greatest quantity** of characters they possibly can and ignore smaller matches.
* **Optional quantifiers**, indicated by the question mark ?, allow us to indicate a character in a regex is optional, meaining it can appear either 0 or 1 time. They only apply to the character directly before the *?* (Note - since ? is a metacharacter, we need to use the escape character \ in our regex in order to match a question mark ? in a piece of text.)
* **The Kleene star** is also a quantifier and matches the preceding character 0 or more times, meaning the character doesn't need to appear, can appear once, or can appear many times. It is denoted with the asterisk *.

        meo*w 
        # will match "me", followed by 0 or more "o", followed by "w". 
        # Thus the regex will match "mew", "meow", "meooow", and "meoooooooooooow".
* **The Kleene plus**, denoted by the plus +, matches the preceding character 1 or more times.

        meo+w 
        # will match the characters "me", followed by 1 or more "o", followed by a "w". 
        # Thus the regex will match "meow", "meooow", and "meoooooooooooow", but not match mew.

* **The anchor symbols** hat ^ and dollar sign $ are used to match text at the start and end of a string, respectively

        \w{3} 
        # will match exactly 3 word characters

        \w{4,7} 
        # will match at minimum 4 and at maximum 7 word characters

        roa{3}r 
        # will match the characters "ro" followed by 3 "a", and then the character "r"

        mo{2,4} 
        # quantifiers are greedy - they will match the text "moooo" in the string "moooo", but not return a match of "moo" or "mooo"
        
        humou?r 
        # matches the characters "humo" then either 0 or 1 occurrence of "u" and finally "r" .

* **Anchors** ensure that we do not match unintended text by making the expression as specific as possible. The anchors hat ^ and dollar sign $ are used to match text at the start and the end of a string, respectively. 

        ^Monkeys: my mortal enemy$ 
        # ^ ensures that the matched text begins with "Monkeys", the $ ensures the matched text ends with "enemy".
        # will completely match "Monkeys: my mortal enemy" 
        # will not match "Spider Monkeys: my mortal enemy in the wild" or "Squirrel Monkeys: my mortal enemy in the wild" 
        # Without the anchor tags, the regex Monkeys: my mortal enemy will match the text Monkeys: my mortal enemy in both Spider Monkeys: my mortal enemy in the wild and Squirrel Monkeys: my mortal enemy in the wild.




Splitting
---------------------------------

Splitting can be useful in cleaning up data with columns that contain more than one type of data. 

We can split by index when at least one of the parts is a set number of characters. We can also split by character, which is useful when the part to be split may vary in length but is always separated by a specific character.

Here's an example of might be used to extract numbers contained within a string using pandas' **.str.split()** function with **regex**
students['grade'].str.split('(\d+)', expand=True[1])

String Parsing 
----------------------------------


Remove or replace unwanted characters using **.replace()** and **regex**

  :: 
    
    students.score = students['score'].replace('[\%,]', '', regex=True)



 
Duplicate rows can be removed using **.drop_duplicates()**. To remove only rows where the duplicate appears in certain columns, we can select subsets containing specifying which columns to look at.

  :: 
    
    df = df.drop_duplicates() ????
    df = df.drop_duplicates(subset=['reps'])


Missing Values
=================


Pandas provides the **isna()** and **notna()** functions to make detecting missing values easier. Some calculations will just skip NaN values, but others will break when NaN is encountered.

Using **.dropna()** will result in the DataFrame without the incomplete rows. We can also select a specific subset if we want to remove only rows containing NaN in specific columns.

We can use **.fillna()** to fill in missing values with the mean of the column or with some other aggregate value.

  :: 
  
    df['column_name'] = df['column_name'].fillna(value_to_fill_in)
    

Some helpful Python string methods
=====================================

Python string method .replace( ). ?? Check if needs to be saved as new variables to "stick" ??

    ::
        
        updated_medical_data = medical_data.replace("#", "$")
    
Python split() functions.

    ::
        
        medical_data_split = updated_medical_data.split(";")
    

Python .strip() method removes spaces at the beginning and at the end of the string. 
