## Python Pandas Working With XML - Part 5

1. Read XML and get Dataframe
2. Convert Dataframe to XML

What is XML?
1. XML stands for eXtensible Markup Language
2. XML is a markup language much like HTML
3. XML was designed to store and transport data
4. XML was designed to be self-descriptive
5. XML is a W3C Recommendation

In [4]:
import pandas as pd

In [34]:
# Reads and parses an XML file named "test.xml" using the pandas library.

pd.read_xml('test.xml')

Unnamed: 0,shape,degrees,sides
0,square,360,4.0
1,circle,360,
2,triangle,180,3.0


In [5]:
xml = '''<?xml version='1.0' encoding='utf-8'?>
<data xmlns="http://example.com">
 <row>
   <shape>square</shape>
   <degrees>360</degrees>
   <sides>4.0</sides>
   <firstname>Krish</firstname>
 </row>
 <row>
   <shape>circle</shape>
   <degrees>360</degrees>
   <sides/>
   <firstname/>
 </row>
 <row>
   <shape>triangle</shape>
   <degrees>180</degrees>
   <sides>3.0</sides>
   <firstname/>
 </row>
</data>'''

In [6]:
pd.read_xml(xml)

Unnamed: 0,shape,degrees,sides,firstname
0,square,360,4.0,Krish
1,circle,360,,
2,triangle,180,3.0,


In [7]:
xml = '''<?xml version='1.0' encoding='utf-8'?>
<data>
  <row shape="square" degrees="360" sides="4.0" firstname="Krish"/>
  <row shape="circle" degrees="360"/>
  <row shape="triangle" degrees="180" sides="3.0" lastname="Naik"/>
</data>'''

In [8]:
pd.read_xml(xml,xpath=".//row")

Unnamed: 0,shape,degrees,sides,firstname,lastname
0,square,360,4.0,Krish,
1,circle,360,,,
2,triangle,180,3.0,,Naik


In [9]:
xml = '''<?xml version='1.0' encoding='utf-8'?>
<doc:data xmlns:doc="https://example.com">
  <doc:row>
    <doc:shape>square</doc:shape>
    <doc:degrees>360</doc:degrees>
    <doc:sides>4.0</doc:sides>
  </doc:row>
  <doc:row>
    <doc:shape>circle</doc:shape>
    <doc:degrees>360</doc:degrees>
    <doc:sides/>
  </doc:row>
  <doc:row>
    <doc:shape>triangle</doc:shape>
    <doc:degrees>180</doc:degrees>
    <doc:sides>3.0</doc:sides>
  </doc:row>
</doc:data>'''

In [10]:
df=pd.read_xml(xml,xpath=".//doc:row",namespaces={"doc": "https://example.com"})


In [11]:
df

Unnamed: 0,shape,degrees,sides
0,square,360,4.0
1,circle,360,
2,triangle,180,3.0


In [21]:
# To perform Exploratory Data Analysis (EDA) on the dataset available at the given URL using pandas, we will follow these steps:

# Step 1: Read the XML data into a pandas DataFrame
url = "https://gist.githubusercontent.com/asascience-deploy/353ff62205118043d215/raw/cf284998e814ee07037a7a8f3ad1c7f7ea904aab/datasets.xml"
df = pd.read_xml(url)


In [13]:
df

Unnamed: 0,type,datasetID,active,reloadEveryNMinutes,fileDir,recursive,fileNameRegex,metadataFrom,preExtractRegex,postExtractRegex,extractRegex,columnNameForExtract,sortedColumnSourceName,sortFilesBySourceNames,fileTableInMemory,addAttributes,dataVariable
0,EDDTableFromNcFiles,test_2788_304d_47c8,True,5,/Users/lcampbell/data/ooi/test/,True,.*\.nc,last,,,,,sci_m_present_time,sci_m_present_time,False,,


In [17]:
# Step 2: Explore basic information about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 17 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   type                    1 non-null      object 
 1   datasetID               1 non-null      object 
 2   active                  1 non-null      object 
 3   reloadEveryNMinutes     1 non-null      float64
 4   fileDir                 1 non-null      object 
 5   recursive               1 non-null      object 
 6   fileNameRegex           1 non-null      object 
 7   metadataFrom            1 non-null      object 
 8   preExtractRegex         0 non-null      float64
 9   postExtractRegex        0 non-null      float64
 10  extractRegex            0 non-null      float64
 11  columnNameForExtract    0 non-null      float64
 12  sortedColumnSourceName  1 non-null      object 
 13  sortFilesBySourceNames  1 non-null      object 
 14  fileTableInMemory       1 non-null      object

In [16]:
# Step 3: Display the first few rows of the DataFrame
df.head()

Unnamed: 0,type,datasetID,active,reloadEveryNMinutes,fileDir,recursive,fileNameRegex,metadataFrom,preExtractRegex,postExtractRegex,extractRegex,columnNameForExtract,sortedColumnSourceName,sortFilesBySourceNames,fileTableInMemory,addAttributes,dataVariable
0,,,,,,,,,,,,,,,,,
1,EDDTableFromNcFiles,test_2788_304d_47c8,True,5.0,/Users/lcampbell/data/ooi/test/,True,.*\.nc,last,,,,,sci_m_present_time,sci_m_present_time,False,,


In [20]:
# Step 4: Analyze summary statistics of numerical columns

df.describe()

Unnamed: 0,reloadEveryNMinutes,preExtractRegex,postExtractRegex,extractRegex,columnNameForExtract,addAttributes,dataVariable
count,1.0,0.0,0.0,0.0,0.0,0.0,0.0
mean,5.0,,,,,,
std,,,,,,,
min,5.0,,,,,,
25%,5.0,,,,,,
50%,5.0,,,,,,
75%,5.0,,,,,,
max,5.0,,,,,,


In [24]:
# Step 5: Check for missing values

df.isnull().sum()

type                      1
datasetID                 1
active                    1
reloadEveryNMinutes       1
fileDir                   1
recursive                 1
fileNameRegex             1
metadataFrom              1
preExtractRegex           2
postExtractRegex          2
extractRegex              2
columnNameForExtract      2
sortedColumnSourceName    1
sortFilesBySourceNames    1
fileTableInMemory         1
addAttributes             2
dataVariable              2
dtype: int64

In [52]:
# Dataframe to xml file

df.to_xml('test1.xml')