# Working with Data - Pandas

`Pandas` is a very popular Python library for data manipulation and analysis, often used for tasks such as data cleaning, data wrangling, statistical analysis, and in use cases such as building machine learning models, data visualization, and creating complex data structures.

There are multiple ways to represent data in Pandas including:

- Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
- Ordered and unordered (not necessarily fixed-frequency) time series data.
- Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
- Any other form of observational / statistical data sets. The data need not be labeled at all to be placed into a pandas data structure

##### Series and Dataframes

The `series` and `dataframe` are two important components of Pandas.

- A pandas `Series` is a one-dimensional labeled array that can hold any data type, similar to a column in a spreadsheet. Think of a series like a single column of a spreadsheet.

- A pandas `DataFrame` is a two-dimensional labeled data structure with columns of potentially different data types, resembling a table or a spreadsheet. Think of dataframe like a spreadsheet with multiple rows and columns. 

<img src="images/2.png" width="500">

##### Avoiding Loops and Iterations
Pandas has powerful functions and methods that operate on entire arrays of data, allowing you to perform operations on large datasets without the need for explicit loops or iterations. These functions, such as aggregation, filtering, and transformation, are designed to efficiently process data in a vectorized manner, resulting in faster and more concise code. By leveraging these capabilities, you can avoid the performance overhead and potential errors that can arise from manual looping and iteration over data.

You'll see this capability in the section under the heading `Select a subset of columns and filter to leave only the required cities`. Notice that you don't need to loop through the dataframe to check if each row matches. You can perform the operation on the entire dataframe with a single line of code!

##### Why are there double square brackets? `[[ ]]`

In some cases you may see data being accessed using double square brackets. For example, `things = df[['Name','Quantity']]`. If you access data from a dataframe just using a single column name you will get back a series.

`names = df['Name']`

<img src="images/3.png" width="500">

If it's a dataframe (multi-dimensional) you ned to pass in a list of column names. In that case you can either store the list of names in a separate variable and pass in that variable. Alternatively you can just pass the list of column names directly to the dataframe. See the following picture for an examples.

<img src="images/4.png" width="500">

##### Starting the lab

The first section of this lab demonstrates a usecase I have very often; working with large Excel spreadsheets. Before you can run the lab make sure you have generated some sample data using the `faker` Python library in `generate_sample_data.ipynb`.

See the following documentation to learn more about pandas

[https://pandas.pydata.org/](https://pandas.pydata.org/)

##### Import the required libraries

In [None]:
import json
import openpyxl
import pandas

##### Read in sample data from an Excel file

In [None]:
RAW_DATA = pandas.read_excel("sample_data.xlsb", sheet_name='Sheet1', header=0, engine='pyxlsb')

##### Print the first 10 rows

In [None]:
RAW_DATA.head(10)

##### Print the column names and observe the file size

In [None]:
RAW_DATA.info()

##### Select a subset of columns and filter to leave only the required cities

In [None]:
STILL_RAW_DATA = RAW_DATA[['Customer ID','Customer Name','City','Order Year','Order Date','Order ID','Store ID',
                            'Product BU','Product Type','Product','Quantity','Item Cost','Total Cost']]

LIST_OF_CITIES_TO_KEEP = ['Vaughnmouth','North Margaretbury','Melanieland','Andreamouth','Aprilfort','South Heatherborough','West Crystal','Lake Andreabury',
                        'Buchanantown','Stephenmouth','Johnberg','East Maryberg','East Kellyview','Mcdanielstad','Francisburgh','Bensonborough','South Jason',
                        'North Jeffreyborough','Gonzalezmouth','Port Joshua','New Anthony','Lake Matthew','Carrilloborough','Cunninghambury','Williamshire',
                        'Jennifershire','Caseyland','Emilyland','Port Amberberg','Kathleenside','Lewisside','South Michaelland','North Sarahborough','Mariaburgh',
                        'West Amanda','Kathleenbury','Michaelshire','Archerview','Randolphtown','Grantfort','Port Lisa','South Scottville','East Edward','West Judithland',
                        'Port Gabriel','Grossview','Port Michaelside','Wardbury','North Brian','North Amy','Port Joseph','South Debrafort','Port Ericafort','Martinton','West Nicole',
                        'Lesliefurt','South Michaelville','East Ericport','Deanmouth','Port Jade','Lake Andrewmouth','Stephanieland','Zacharyshire','Sheilaville','Robinsonbury',
                        'Lake Chad','Lake Jessica','Lindsayfurt','Port Samantha','Port Michele','West Paul','Kimberlychester','Elizabethburgh','Jasonville','Webbhaven','Kristenchester',
                        'Taylorfort','Johnborough','West Sean','East Brian','South Connortown','Millermouth','Morenoside','Maryhaven','West Wesley','Martinfurt','Alvaradomouth',
                        'East Annton','Shepardhaven','Douglasmouth','Phillipland','Juliaside','East Holly','Carterport','New Lindsey','North Jacqueline','North Tyler',
                        'East Roger','Marshallfort','Pattersonmouth']

filtered_data = STILL_RAW_DATA.loc[STILL_RAW_DATA['City'].isin(LIST_OF_CITIES_TO_KEEP)]

##### Show the columns of the new dataframe 

##### Observe the file size

In [None]:
filtered_data.info()

##### Print 10 rows

In [None]:
filtered_data.head(10)

##### Group some of the columns and sum the quantities

In [None]:
cleaned_data = filtered_data.groupby(['City','Order Year','Product'])["Quantity"].sum().reset_index()
cleaned_data.info()
cleaned_data.head(10)

##### Export the results to Excel

In [None]:
cleaned_data.to_excel(r'tmp_subset_of_data.xlsx', index = False)

##### Open `tmp_subset_of_data.xlsx` and confirm the data has been written. Observe the file size

##### Cleanup by removing the Excel file

In [None]:
import os
if os.path.exists("tmp_subset_of_data.xlsx"):
    os.remove("tmp_subset_of_data.xlsx")

`Pandas` can work with many data formats. For example you may want to transform nested JSON received through an API call

##### Working with Pandas and JSON received from API calls

In [1]:
# This activity runs the "show interface brief" command against a Nexus 9000 API

import requests
import json
import pandas

from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)

headers = {"content-type": "application/json"}


payload={
  "ins_api":{
    "version": "1.0",
    "type": "cli_show",
    "chunk": "0",
    "sid": "1",
    "input": "show interface brief",
    "output_format": "json"
    }
}

url = "https://sbx-nxos-mgmt.cisco.com/ins"
response = requests.post(url, data=json.dumps(payload), headers=headers,auth=("admin","Admin_1234!"), verify=False).json()

interfaces = pandas.json_normalize(response["ins_api"]["outputs"]["output"]["body"]["TABLE_interface"]["ROW_interface"])

print(interfaces.info())
print(interfaces.head(10))

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

In [None]:
# This activity runs the "show ip interface eth1/1" command against a Nexus 9000 API

import requests
import json
import pandas

from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)

headers = {"content-type": "application/json"}

payload={
  "ins_api":{
    "version": "1.0",
    "type": "cli_show",
    "chunk": "0",
    "sid": "1",
    "input": "show ip interface eth1/1",
    "output_format": "json"
    }
}

url = "https://sbx-nxos-mgmt.cisco.com/ins"
response = requests.post(url, data=json.dumps(payload), headers=headers,auth=("admin","Admin_1234!"), verify=False).json()

interface_eth_1_1 = pandas.json_normalize(response["ins_api"]["outputs"]["output"]["body"]["TABLE_intf"]["ROW_intf"])

print(interface_eth_1_1.info())
print(interface_eth_1_1.head(10))

# When using Jupyter you can also remove the print statement and just write the variable you wish to print
interface_eth_1_1.head(10)