## Modelling Qualitative Data -- Overview

 1. Goal: learning a general workflow that you can easily implement to modelling qualitative data
 2. Tools: 
     - mainstream Linux operating system 
     - dataset consisting in various text based data
     - low level native Unix programs to preprocess your data
     - python programs to do your analysis
     - python programs to layout your results
     - python programs to make presentations of your results

## What are qualitative data and how to modelling them

1. Qualitative data: mostly text based data of various types (not uniquely, f.ex. also pictures)
    - postcards, vignettes, short texts (f.ex. tweets)
    - newspapers' articles
    - scientific articles
    - books
    - corpora = collections of text data, i.e. big (qualitative) data

2. Modelling techniques: techniques to understand the information content of your dataset

## Several modelling techniques -- Exclusively human-centric techniques

1. human-centric techniques:
   - based on reading the dataset and interpret it
   - categorization of the information content in the dataset
   - structuration of the categories based on the interpretation of dataset
   - result: analysis of this structure = output the hidden/latent structure of the dataset
        
     * advantages: 
       - based on the human understanding of texts = better control over the interpretation of texts
       - sensible to polysemic meaning of words/texts
     * disadvantages:
       - difficult to scale results to the total amount of the investigated material -- main results often apply to 10-15% of the investigated material       
       - difficult to generalized results out of the given dataset
       - difficult to reproduce the results
       - difficult to share the results with other researchers
       - difficult to generalized the results to other sources/actors where the dataset comes from
       - possible interpretation bias
       - time consuming --> often limit the scope of data that can be investigated in a given time
       - complete quantitative oriented research designs, but not compatible with them = parallel routes

## Several modelling techniques -- Mostly human-centric techniques

2. mostly human-centric techniques, with the help of basic computing:
   - based on reading the dataset and interpret it
   - categorization of the information content in the dataset -- computer driven
   - basic statistics (mostly frequencies of words' occurrences and distribution of words)
   - structuration of the categories based on the interpretation of dataset -- computer driven
   - basic structuration tools (f.ex. MAXQDA, NLP techniques)
   - result: analysis of this structure = output the hidden/latent structure of the dataset
   
     * advantages: 
       - based on the human understanding of texts, add computer driven facilities
       - sensible to polysemic meaning of words/texts
       - better at generalizing the results out of the given dataset than exclusively human-centric techniques
       - less time consuming than human-centric techniques
     * disadvantages:
       - difficult to scale results to the total amount of the investigated material -- main results often apply to 10-15% of the investigated material       
       - difficult to reproduce the results
       - difficult to share the results with other researchers
       - difficult to generalized the results to other sources/actors where the dataset comes from
       - possible interpretation bias
       - complete quantitative oriented research designs, but not compatible with them = parallel routes

## Several modelling techniques -- Partly human-centric techniques

3. partly human-centric techniques, partly computer driven:
   - reading the dataset and processing it is computer driven
   - categorization of the information content in the dataset -- computer driven
   - advanced analytics using statistic or mathematic modelling methods
   - structuration of the categories based on the modelling methods
   - advanced structuration tools (f.ex. R, Python)
   - result: analysis of this structure = output the hidden/latent structure of the dataset --> rests on human understanding of the results
   
     * advantages:
       - based on the human understanding of texts, add advanced data analytics
       - better scaling of the results --> apply to the total amount of the investigated material
       - better at generalizing the results out of the given dataset than other human-centric techniques
       - less time consuming than other human-centric techniques
       - results can easily be reproduced
       - results can easily be shared with other researchers
       - better at generalizing the results to other sources/actors where the dataset comes from
       - better at accumulating further data to enrich the dataset
       - better at comparing same kind of data in different languages
       - reduce the interpretation bias
       - better compatibility with quantitative oriented research designs = converging routes
     * disadvantages:
       - less sensible to polysemic meaning of words/texts (even in AI framworks)
       - knowledge demanding --> skills in programming (which can be time consuming)
       

## Right tools for the right tasks -- Operating systems

1. Why Linux?
  - opensource operating system -- easy to install and maintain at no economic costs
  - take the most out of dated hardware --> reuse your old computers
  - portable -- use the OS on a lot of hardware, as well as from simple external drives or USB keys
  - mainstream software for all mainstream tasks
  - powerfull software for data analytics:
    - install R CRAN and related packages 
    - Python comes natively with the operating system
    - benefit from native unix low programming utilities to tailor the dataset
    - deliver opensource free software to extend the analytic framework
    
2. Why not Windows or MacOS (or * BSD)?
    - cost of the operating system and the software
    - no portability of the software to other hardware -- you have to stick with one given hardware
    - Windows: no out-of-the-box tools to tailor the dataset --> limited choice of unix tools compatible with Windows
    - MacOS and * BSD flavors: some out-of-the-box tools to tailor the dataset --> not always compatible with same unix tools -- * BSD OS are more involving
    - but: you can install R (directly) and Python (with f.ex. Anaconda) for data analytics
    - but: you can install opensource free software to extend the analytic framework
    
Linux is more often used in the context of data analytics because of its practicability and scaling capacity. Disadvantage: coming from Windows or MacOS, there is a learning curve regarding: 
  - the use of the command line
  - the use of equivalent software to the ones you have on Windows/MacOS --> f.ex. LibreOffice instead of Microsoft Office
  - in general: changing some of your habits

Benefit: gain in autonomy with your research purpose --> you can do and design your work and workflow as you want, i.e. you are not limited by the OS 

## Right tools for the right tasks -- Software

1. Why Python and not R?
  - both are excellent software and programming environment with a long history and a great community
  - both have a learning curve
  - R -- mainly used for statistics
  - Python -- more general approach to data science
  - R -- you use the flexibility of R libraries
  - Python -- you can write your application from scratch
  - R -- runs locally
  - Python -- better integration with apps and better deployment 

Python is often the first and evident choice when it comes to machine learning framework design -- easy to find support and material for your work