# Data Process

In this file, I will focus on the way the data is handled.

The writing of this document was inspired by the mathmatical modeling competition that ended a while ago. The raw data is stored in a variety of formats, and templating the code for reading and processing the data helps me categorize the methods.

This document will use Python. Using other coding languages (e.g. matlab) will be given in other documents.

## 1. Read data

### 1.1 Read data from txt

In [None]:
# Read data in txt files in batches.
import os
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None) #show all columns
pd.set_option('display.max_columns', 100) #show 100 lines
pd.set_option('display.width', 5000) #Set the display width to avoid text wrapping 

def read_txtdata_batches(dir_name, data_title):
    """Read data from txt files in batch"""

    # set file's path
    path = "E:\\ProgramNew\\Pfile_jupyter\\Module21\\UBW_data"
    datapath = os.path.join(path, dir_name, "")
    # print(datapath)

    files = os.listdir(datapath)    # Get all names in this folder
    print("Total file number: " + str(len(files)))
    txts = []
    txts.append(data_title)
    # i = 0

    for file in files:
        position = os.path.join(datapath, file)  # Construct an absolute path
        # Get the index for the file
        file_index = file.split('.')[0]

        # Read file
        with open(position, "r", encoding='UTF-8') as f:
            lines = f.readlines()
            # if i == 0:
            #     line_title = lines[0]
            #     line_title = line_title.split(':')
            #     line_title.append('FileIndex')
            #     txts.append(line_title)

            # 
            for line in lines[1:]:
                line=line.strip('\n')
                line_split = line.split(':')
                line_split.append(file_index)
                txts.append(line_split)
            # i = i + 1
            f.close()

    return txts

### 1.2 Write data to cvs

In [None]:
# Construct a CSV file
import csv
import codecs

def data_write_csv(file_name, datas):
#     file_csv = codecs.open(file_name, 'w', 'utf-8')
#     writer = csv.writer(file_csv, delimiter=' ', quotechar=' ', quoting=csv.QUOTE_MINIMAL)
#     for data in datas:
#         writer.writerow(data)
#     print("File saved, processing end!")

    f = open(file_name, 'w', encoding='utf-8', newline='')
    writer = csv.writer(f)
    for data in datas:
        writer.writerow(data)
    f.close
    print("File saved, processing end!")

In [None]:
# example that using the function read_txtdata_batches and data_write_csv:
data_txt = []
original_data_title = ["T", "Time", "RangeReport", "TagID", "AnchorID", "Dis", "DisCheck", "DataSerialNum", "DataNum", "FileIndex"]
data_txt = read_txtdata_batches("正常数据", original_data_title)
print(len(data_txt))

data_write_csv('normal.csv', data_txt)

数据的格数如下图：

T:090531087:DecaRangeRTLS:LogFile:z?m????:Conf:Tag0:1:Chan2

T:090531088:RR:0:0:760:760:229:3301

T:090531088:RR:0:1:4550:4550:229:3301

T:090531088:RR:0:2:4550:4550:229:3301

T:090531088:RR:0:3:6300:6300:229:3301

T:090531296:RR:0:0:760:760:230:3302

T:090531296:RR:0:1:4550:4550:230:3302

T:090531296:RR:0:2:4550:4550:230:3302

T:090531296:RR:0:3:6300:6300:230:3302

T:090531513:RR:0:0:770:770:231:3303

T:090531513:RR:0:1:4550:4550:231:3303

T:090531513:RR:0:2:4550:4550:231:3303

T:090531513:RR:0:3:6300:6300:231:3303

T:090531711:RR:0:0:780:780:232:3304

T:090531711:RR:0:1:4550:4550:232:3304

T:090531712:RR:0:2:4550:4550:232:3304

## 2 Show data information

### 2.1 Use DataFrame object

In this section, I will use some usefule methods in DataFrame object to show data information.

I mainly use methods of DataFrame object such as:

1. DataFrame.head()

2. DataFrame.shape

3. DataFrame.describe()

4. DataFrame.info()

In [None]:
pd_data_txt = pd.DataFrame(data_txt[1:], columns=data_txt[0])
print("Head entries of data:")
print(pd_data_txt.head())
print(60 * '#')
print("The shape of data:")
print(pd_data_txt.shape)
print(60 * '#')
print("The describe of data:")
print(pd_data_txt.describe())
print(60 * '#')
print("Information for data:")
print(pd_data_txt.info())

Construct table to show unique and missing information.

**Note：** 
1. nunique值为1是不具备任何意义的，各种值都一样，不存在区分性，应当删除
2. 变量缺失值很多，如达到95%以上，亦可以考虑删除

In [None]:
# show something useful information of data
def show_unique_miss(mydata):
    stats = []
    for col in mydata.columns:
        stats.append((col, mydata[col].nunique(),
                     mydata[col].isnull().sum() * 100/mydata.shape[0], 
                     mydata[col].value_counts(normalize=True, dropna=False).values[0] * 100,
                     mydata[col].dtype))
        stats_df = pd.DataFrame(stats, columns=['Feature', 'Unique_values', 'Percentage of missing values', 
                                               'Percentage of values in the biggest category', 'type'])

#  Histogram of missing values of variables.
# Histogram of missing values of variables.
    # plot the missing infromation
    # missing = mydata.isnull().sum()
    # missing = missing[missing > 0]
    # missing.sort_values(inplace=True)
    # missing.plot.bar()
    
    return stats_df.sort_values('Percentage of missing values', ascending=False)

## 3 Data Miner

For continuous univariate visualization view the distribution of the observed values.

In [None]:
import seaborn as sns
import matplotlib as mpl
from matplotlib import pyplot as plt

def show_distribution_data(pd_data):
    pd_data = pd_data.apply(pd.to_numeric, errors='ignore')
    # print(pd_data.info())
    df_num = pd_data.select_dtypes(include=['float64', 'int64'])
    df_num = df_num[df_num.columns.tolist()[1:9]]
    df_num.hist(figsize=(16,20), bins=50, xlabelsize=8,ylabelsize=8)

Feature correlation analysis.

In [None]:
mport seaborn as sns
import matplotlib as mpl
from matplotlib import pyplot as plt
def show_correlation(pd_data):
    corrmat = pd_data.corr()
    f, ax = plt.subplots(figsize=(20,9))
    sns.heatmap(corrmat, vmax=0.8, square=True)