## Chapter 3 Notes

- Data wrangling is taking data and turning it into something meaningful for analysis, also known as data manipulation
  - Being clear about the steps taken during data wrangling helps prevent deception of data
- Data wrangling can be broken down into the following:
  - Data Cleaning
    - Follows data importing
    - Includes renaming, sorting, filter, converting data types, and addressing duplicates and nulls
  - Data Transformation
    - Conversion between wide and long form of data
    - Wide data is good for analysis and database design
      - Data is represented with measurements of variables as unique columns and one observation per row
    - Long data is good for plotting and for a fixed table schema
  - Data Enrichment
    - Adding to data to make it more meaningful
    - This includes adding new columns, binning, aggregation, and resampling
- Learning how to connect and explore and API is important
  - May need a token to access the API and knowledge of json formatting
  - Pg. 127 to 138 provides a detailed example of a API connection process
- Cleaning data
  - The `columns` attribute of dataframes can be used to explore the column headers
  - Columns can be renamed with the `rename()` method
    - Dictionary passed to `columns` to assign names
    - `inplace` can be passed
    - Series and Index objects can be renamed
    - String methods can also be passed to the `rename()` method
  - Data types can explored using the `dtypes` attribute
  - Date objects can be converted to datetimes using the `pd.to_datetime()` method
    - `parse_dates = True` can be passed into a `read_csv()` call to parse all dates in a .csv files
    - Datetimes can be truncated using the `to_period()`, which pulls year, month, day, etc. from the datetime
  - The `assign()` method can also be used to cast a different data type
    - lambda functions can be used in the `assign()` method and are useful in data type conversion and column generation
  - The `astype()` method can be used to change datatype of a single column
  - `pd.to_numeric()` will automatically convert numeric data to integer or float
  - The category data type can be assigned to a column if there are a limited number of distinct values
    - Pandas can pull more stats about categorical data
  - Rows and columns can be sorted using the `sort_values()` method
    - Row sorting can be performed with one or more columns
    - `nlargest()` and `nsmallest()` can be used to pull a subset of sorted data
    - Index values can be sorted with the `sort_index()` method, columns can also be sorted with this method by passing in `axis = 1`
    - The `set_index()` method can be used to assign a column as the index, e.g. a columns with dates
      - Slicing and indexing can be used with a datetime index
    - The `reset_index()` method will reset the index
    - The `reindex()` method can be used to align a dataset with an existing index
      - Pg. 152 to 160 has an example of this with stock and bitcoin trade value analysis
- Reshaping data
  - The transpose `T` method can be used to transpose a dataframe
  - The `pivot()` method is used to switch data from long to wide format
    - The column with the values for the wide format columns and the values for these columns must be specified
    - A new index can be set
    - A hierarchical index can be set in columns as well
    - When trying to transform a long dataset with a multi-level index, use the `unstack()` method instead, which also allows you to specify how to fill missing values
  - The `melt()` method is used to switch from wide to long format
    - The `melt()` method requires an `id_vars` input identifying the column in the wide dataset with data that uniquely represents a row and a `value_vars` input to identify the columns in the wide dataset containing the values
    - `var_name()` and `value_name()` are optional inputs to name the variable column in the long format and the new column of values
    - The `stack()` method is another way to transform wide data to long, pivoting columns to the inner most level of the index
- Handling duplicate, missing, or invalid data
  - Finding problematic data
    - The `describe()` method can be used to explore the data for NaN's, Inf's, and invalid data
    - The `info()` method can be used to check data types in the columns
    - `isnull()` and `isna()` can be used to find null and NaN values, it can be also used to create Boolean filters
      - `isna()` must be used instead of searching for a NaN value or string as NaN in python is equal to nothing
      - inf and -inf are also equal to `np.inf` and `-np.inf` and can be addressed as such
  - Mitigating issues
    - While some data may see problematic, one must take care not to replace or delete it before further investigating
      - Maybe the data present is a placeholder for a different value or the problematic data won't have an influence on the final results and can be left in
      - Consult documentation behind the data, if available
    - `dropna()` method will drop the rows with any null data
    - `fillna()` is available to replace NaN's with a specific value, for instance in the event that NaN's are representative of a different value
    - `np.nan_to_num()` function can be used to replace inf's with an actual value
    - Another approach to filling missing data is through imputation, or filling NaN's with a value derived from the other values in the column, like the mean or median
      - This will reduce the impact of the missing data
    - Finally, interpolation can be used to fill missing data through the `interpolate()` and `apply()` methods