# Data Analytics (Part#3)

Table DA_31: Series attributes

|Series| Description|
|------|------|
| `loc` | Subset using index value |
| `iloc` | Subset using index position |
| `ix` | Subset using index value and/or position |
| `dtype` or `dtypes` | The type of the `Series` contents |
| `T` | Transpose of the series |
| `shape` | Dimensions of the data |
| `size` | Number of elements in the `Series` |
| `values` | `ndarray` or `ndarray`-like of the `Series` |

Table DA_32: Series methods

|Series| Description|
|------|------|
| `append` | Concatenates two or more `Series` |
| `corr` | Calculate a correlation with another `Series` |
| `cov` | Calculate a covariance with another `Series` |
| `describe` | Calculate summary statistics |
| `drop_duplicates` | Returns a `Series` without duplicates |
| `equals` | Determines whether a `Series` has the same elements |
| `get_values` | Get values of the `Series`; same as the `values` attribute |
| `hist` | Draw a histogram |
| `isin` | Checks whether values are contained in a `Series` |
| `min` | Returns the minimum value |
| `max` | Returns the maximum value |
| `mean` | Returns the arithmetic mean |
| `median` | Returns the median |
| `mode` | Returns the mode(s) |
| `quantile` | Returns the value at a given quantile |
| `replace` | Replaces values in the `Series` with a specified value |
| `sample` | Returns a random sample of values from the `Series` |
| `sort_values` | Sorts values |
| `to_frame` | Converts a `Series` to a `DataFrame` |
| `transpose` | Returns the transpose |
| `unique` | Returns a `numpy.ndarray` of unique values |

**All the pandas attributes and methods can be found the link$^1$.**

In [1]:
# Load the dataset
import pandas as pd
dataset_filename = "../Chapter 3: Control Statements/Online_Retail.xlsx"
df_transaction = pd.read_excel(dataset_filename) # Read execel file
df_transaction.head(15) # display top 15 records

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
5,536365,22752,SET 7 BABUSHKA NESTING BOXES,2,2010-12-01 08:26:00,7.65,17850.0,United Kingdom
6,536365,21730,GLASS STAR FROSTED T-LIGHT HOLDER,6,2010-12-01 08:26:00,4.25,17850.0,United Kingdom
7,536366,22633,HAND WARMER UNION JACK,6,2010-12-01 08:28:00,1.85,17850.0,United Kingdom
8,536366,22632,HAND WARMER RED POLKA DOT,6,2010-12-01 08:28:00,1.85,17850.0,United Kingdom
9,536367,84879,ASSORTED COLOUR BIRD ORNAMENT,32,2010-12-01 08:34:00,1.69,13047.0,United Kingdom


In [2]:
print(df_transaction.dtypes)
df_transaction.describe()

InvoiceNo              object
StockCode              object
Description            object
Quantity                int64
InvoiceDate    datetime64[ns]
UnitPrice             float64
CustomerID            float64
Country                object
dtype: object


Unnamed: 0,Quantity,UnitPrice,CustomerID
count,1999.0,1999.0,1466.0
mean,9.204102,3.7995,15617.886085
std,28.295901,13.684213,1868.749126
min,-24.0,0.0,12431.0
25%,1.0,1.45,14307.0
50%,3.0,2.51,15525.0
75%,8.0,4.21,17850.0
max,600.0,569.77,18144.0


In [3]:
df_transaction['UnitPrice'].mean()

3.7994997498749234

In [4]:
df_transaction_2 = df_transaction.loc[5:18, ['Quantity','UnitPrice']]
df_transaction_2

Unnamed: 0,Quantity,UnitPrice
5,2,7.65
6,6,4.25
7,6,1.85
8,6,1.85
9,32,1.69
10,6,2.1
11,6,2.1
12,8,3.75
13,6,1.65
14,6,4.25


**Operation on Series and DataFrame are Vectorized**

In [5]:
print(df_transaction_2['UnitPrice'] > df_transaction_2['UnitPrice'].mean())
print(df_transaction_2['UnitPrice'] + df_transaction_2['UnitPrice'])
print(df_transaction_2['Quantity'] + 3)  # scalar operation

5      True
6      True
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14     True
15     True
16     True
17     True
18     True
Name: UnitPrice, dtype: bool
5     15.30
6      8.50
7      3.70
8      3.70
9      3.38
10     4.20
11     4.20
12     7.50
13     3.30
14     8.50
15     9.90
16    19.90
17    11.90
18    11.90
Name: UnitPrice, dtype: float64
5      5
6      9
7      9
8      9
9     35
10     9
11     9
12    11
13     9
14     9
15     6
16     5
17     6
18     6
Name: Quantity, dtype: int64


In [6]:
df_transaction[df_transaction['UnitPrice'] > 30]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
246,536392,22827,RUSTIC SEVENTEEN DRAWER SIDEBOARD,1,2010-12-01 10:29:00,165.0,13705.0,United Kingdom
294,536396,22803,IVORY EMBROIDERED QUILT,2,2010-12-01 10:51:00,35.75,17850.0,United Kingdom
431,536406,22803,IVORY EMBROIDERED QUILT,2,2010-12-01 11:33:00,35.75,17850.0,United Kingdom
1423,536540,C2,CARRIAGE,1,2010-12-01 14:05:00,50.0,14911.0,EIRE
1665,536544,22769,CHALKBOARD KITCHEN ORGANISER,1,2010-12-01 14:32:00,51.02,,United Kingdom
1677,536544,22847,BREAD BIN DINER STYLE IVORY,1,2010-12-01 14:32:00,34.0,,United Kingdom
1814,536544,DOT,DOTCOM POSTAGE,1,2010-12-01 14:32:00,569.77,,United Kingdom


### Add Columns
New columns can be added on existing Dataframe. The InvoiceNo has type string which should be integer. We'll add new incoviceno with new integer data types from same InvoiceNo.

In [7]:
print(df_transaction['InvoiceNo'].dtype)  # check data type of InvoiceNo
# Convert InvoiceNo to Numeric
Invoice_No = pd.to_numeric(df_transaction['InvoiceNo'])
## The error will occur since InvoiceNo is alpha numeric.

object


ValueError: Unable to parse string "C536379" at position 141

In [None]:
## Check record with InvoiceNo C536379
df_transaction.loc[df_transaction['InvoiceNo'] =='C536379']

**Filter alphanumeric records**

In [None]:
# Filter alphanumeric records for invoiceno.
df_transaction_filtered = df_transaction[pd.to_numeric(df_transaction['InvoiceNo'], errors='coerce').notnull()]
print(df_transaction_filtered.shape)
Invoice_No = pd.to_numeric(df_transaction_filtered['InvoiceNo'])  # create new invoice_no
#Invoice_No
df_transaction_filtered['Invoice_No'] = (Invoice_No) # add new column to df
print(df_transaction_filtered.shape)
print(df_transaction_filtered.dtypes)

### Drop Columns
Columns can be dropped by subsetting columns or using drop method in dataframe. The code snippet is dropping the column using drop method since we know about subsetting method.

In [None]:
print(df_transaction_filtered.columns)  # Check the columns before dropping
df_transaction_filtered_dropped = df_transaction_filtered.drop(['InvoiceNo'], axis=1) # Drop the InvoiceNo column
print(df_transaction_filtered_dropped.columns)

### Importing Data

Table DA_33: Import different file format

|Series| Description|
|------|------|
| `read_csv` | Load delimited data from a file, URL, or file-like object; use comma as default delimiter
 |
| `read_table` | Load delimited data from a file, URL, or file-like object; use tab ('\t') as default delimiter
 |
| `read_fwf` | Read data in fixed-width column format (i.e., no delimiters)
 |
| `read_clipboard` | Version of read_table that reads data from the clipboard; useful for converting tables from web pages
 |
| `read_excel` | Read tabular data from an Excel XLS or XLSX file
 |
| `read_hdf` | Read HDF5 files written by pandas |
| `read_html` | Read all tables found in the given HTML document |
| `read_json` | Read data from a JSON (JavaScript Object Notation) string representation |
| `read_msgpack` | Read pandas data encoded using the MessagePack binary format |
| `read_pickle` | Read an arbitrary object stored in Python pickle format |
| `read_sas` | Read a SAS dataset stored in one of the SAS system’s custom storage formats |
| `read_sql` | Read the results of a SQL query (using SQLAlchemy) as a pandas DataFrame |
| `read_stata` | Read a dataset from Stata file format |
| `read_feather` | Read the Feather binary file format |

#### Read function arguments

Table DA_34: Read arguments

|Series| Description|
|------|------|
| `path` | String indicating filesystem location, URL, or file-like object
 |
| `sep` or `delimiter` | Character sequence or regular expression to use to split fields in each row
 |
| `header` | Row number to use as column names; defaults to 0 (first row), but should be None if there is no header row
 |
| `index_col` | Column numbers or names to use as the row index in the result; can be a single name/number or a list of them for a hierarchical index
 |
| `names` | List of column names for result, combine with header=None
 |
| `skiprows` | Number of rows at beginning of file to ignore or list of row numbers (starting from 0) to skip. |
| `na_values` | Sequence of values to replace with NA. |
| `comment` | Character(s) to split comments off the end of lines. |
| `parse_dates` | Attempt to parse data to datetime; False by default. If True, will attempt to parse all columns. Otherwise can specify a list of column numbers or name to parse. If element of list is tuple or list, will combine multiple columns together and parse to date (e.g., if date/time split across two columns). |
| `keep_date_col` | If joining columns to parse date, keep the joined columns; False by default. |
| `converters` | Dict containing column number of name mapping to functions (e.g., {'foo': f} would apply the function f to all values in the 'foo' column). |
| `dayfirst` | When parsing potentially ambiguous dates, treat as international format (e.g., 6/9/2019 -> September 6, 2019); False by default. |
| `date_parser` | Function to use to parse dates. |
| `nrows` | Number of rows to read from beginning of file. |
| `iterator` | Return a TextParser object for reading file piecemeal. |
| `chunksize` | For iteration, size of file chunks. |
| `skip_footer` | Number of lines to ignore at end of file. |
| `verbose` | Print various parser output information, like the number of missing values placed in non-numeric columns. |
| `encoding` | Text encoding for Unicode (e.g., 'utf-8' for UTF-8 encoded text). |
| `squeeze` | If the parsed data only contains one column, return a Series. |
| `thousands` | Separator for thousands (e.g., ',' or '.'). |

### Exporting Data

Table DA_35: Export into different file format

|Series| Description|
|------|------|
| `to_csv` | Save data into a Python
 |
| `to_excel` | Save data into a Excel
 |
| `to_pickle` | Save data into a Pickle
 |
| `to_feather` | Convert data into a Feather
 |
| `to_clipboard` | Save data into the system clipboard for pasting
 |
| `to_dense` | Convert data into a regular “dense” DataFrame |
| `to_dict` | Convert data into a Python |
| `to_gbq` | Convert data into a Google BigQuery table |
| `to_hdf` | Save data into a hierarchal data format (HDF) |
| `to_msgpack` | Save data into a portable JSON-like binary |
| `to_html` | Convert data into a HTML table |
| `to_json` | Convert data into a JSON string |
| `to_latex` | Convert data into a LATEX tabular environment |
| `to_records` | Convert data into a record array |
| `to_string` | Show DataFrame as a string for stdout |
| `to_sparse` | Convert data into a SparceDataFrame |
| `to_sql` | Save data into a SQL database |
| `to_stata` | Convert data into a Stata dta file |

**For more detail on input/output$^3$ fileformat**.

**References**  
$^1$ https://pandas.pydata.org/pandas-docs/stable/reference/series.html#constructor  
$^2$ https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#  
$^3$ https://pandas.pydata.org/pandas-docs/stable/reference/io.html